Full Table of Contents at end of document.
Chapter Contents
(back to top)
1.1 Purpose
1.2 Audience
1.3 Topics Covered
1.4 Topics Not Covered
1.5 Not Optimized for Smooth Reading
1.6 How D3D11.3 Fits into this Unified Spec
This document describes hardware requirements for Direct3D 11.3 (D3D11.3).
It is assumed that the reader is familiar with real-time graphics, modern Graphics Processing Unit (GPU) design issues and the general architecture of Microsoft Windows Operating Systems, as well their planned release roadmap.
The target audience for this spec are the implementers, testers and documenters of hardware or software components that would be considered part of a D3D11.3-compliant system. In addition, software developers who are vested in the details about medium-term GPU hardware direction will find interesting information.
Topics covered in this spec center on definition of the hardware architecture being targeted by the D3D11.1 Graphics Pipeline, in a form that attempts to be agnostic to any single vendor's hardware implementation. Included will be some references to how the Graphics Pipeline is controlled through a Device Driver Interface (DDI), and occasionally depictions of API usage as needed to illustrate points.
Occasionally, boxed text such as this appears in the spec to indicate justification for decisions, explain history about a feature, provide clarifications or general remarks about a topic being described, or to flag an unresolved issues. These shaded boxes DO NOT provide a complete listing of all such trivia, however. Note that on each revision of this spec, all changes made for that revision are summarized in a separate document typically distributed with the spec.
The exact relationship and interactions between topics covered in the Graphics Pipeline with other Operating System components is not covered.
GPU resource management, GPU process scheduling, and low-level Operating System driver/kernel architecture are not covered.
High-level GPU programming concepts (such as high level shading languages) are not covered.
Little to no theory or derivation of graphics concepts, techniques or history is provided. Equally rare for this spec is any attempt to characterize what sorts of things applications software developers might do using the functionality provided by D3D11.3. There are exceptions, but do not expect to gain much more than an understanding of the "facts" about D3D11.3 from this spec.
Beware, there is little flow to the content in this spec, although there are plenty of links from place to place.
This document is the product of starting with the full D3D11.2 functional spec and adding in relevant WindowsNext D3D11.3 features.
Each Chapter in this spec begins with a summary of the changes from D3D10 to D3D10.1 to D3D11 to D3D11.1 to D3D11.2 to D3D11.3 for that Chapter. A table of links to all of the Chapter delta summaries can be found here(25.2).
To find D3D11.3 changes specifically (which includes changes for optional new features and clarifications/corrections that affect all feature levels, look for "[D3D11.3]" in the chapter changelists (or simply search the doc for it).
Chapter Contents
(back to top)
2.1 Input Assembler (IA) Overview
2.2 Vertex Shader (VS) Overview
2.3 Hull Shader (HS) Overview
2.4 Tessellator (TS) Overview
2.5 Domain Shader (DS) Overview
2.6 Geometry Shader (GS) Overview
2.7 Stream Output (SO) Overview
2.8 Rasterizer Overview
2.9 Pixel Shader (PS) Overview
2.10 Output Merger (OM) Overview
2.11 Compute Shader (CS) Overview
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
D3D11.1 hardware, like previous generations, can be designed with shared programmable cores. A farm of Shader cores exist on the GPU, able to be scheduled across the functional blocks comprising the D3D11.1 Pipeline, depicted below.
The Input Assembler (IA) introduces triangles, lines, points or Control Points (for Patches) into the graphics Pipeline, by pulling source geometry data out of 1D Buffers(5.3.4).
Vertex data can come from multiple Buffers, accessed in an "Array-of-Structures" fashion from each Buffer. The Buffers are each bound to an individual input slot and given a structure stride. The layout of data across all the Buffers is specified by an Input Declaration, in which each entry defines an "Element" with: an input slot, a structure offset, a data type, and a target register (for the first active Shader in the Pipeline).
A given sequence of vertices is constructed out of data fetched from Buffers, in a traversal directed by a combination of fixed-function state and various Draw*() API/DDI calls. Various primitive topologies are available to make the sequence of vertex data represent a sequence of primitives. Example topologies are: point-list, line-list, triangle-list, triangle-strip, 8 control-point patch-list.
Vertex data can be produced in one of two ways. The first is "Non-Indexed" rendering, which is the sequential traversal of Buffer(s) containing vertex data, originating at a start offset at each Buffer binding. The second method for producing vertex data is "Indexed" rendering, which is sequential traversal of a single Buffer containing scalar integer indices, originating at a start offset into the Buffer. Each index indicates where to fetch data out of Buffer(s) containing vertex data. The index values are independent of the characteristics of the Buffers they are referring to; Buffers are described by a declaration as mentioned earlier. So the task accomplished by "Non-Indexed" and "Indexed" rendering, each in their own way, is producing addresses from which to fetch vertex data in memory, and subsequently assemble the results into vertices and primitives.
Instanced geometry rendering is enabled by allowing the sequential traversal, in either Non-indexed or Indexed rendering, to loop over a range within each Vertex Buffer (Non-Indexed case) or Index Buffer (Indexed case). Buffer-bindings can be identified "Instance Data" or "Vertex Data", indicating how to use the bound Buffer while performing instanced rendering. The address generated by "Non-Indexed" or "Indexed" rendering is used to fetch "Vertex Data", accounting also for looping when doing Instanced rendering. "Instance Data", on the other hand, is always sequentially traversed starting from a per-Buffer offset, at a frequency equal to one step per instance (e.g. one step forward after the number of vertices in an instance are traversed). The step rate for "Instance Data" can also be chosen to be a subharmonic of the instance frequency (i.e. one step forward every other instance, every third instance etc.).
Another use of the Input Assembler is that it can read Buffers that were written to from the Stream Output(2.7) stage. Such a scenario necessitates a particular type of Draw, DrawAuto(8.9). DrawAuto enables the Input Assembler to know how much data was dynamically written to a Stream Output Buffer without CPU involvement.
In addition to producing vertex data from Buffers, the IA can auto-generate scalar counter values such as: VertexID(8.16), PrimitiveID(8.17) and InstanceID(8.18), for input to shader stages in the graphics pipeline.
In "Indexed" rendering of strip topologies, such as triangle strips, a mechanism is provided for drawing multiple strips with a single Draw*() call (i.e. 'cut'ting strips).
Specific operational details of the IA are provided here(8).
The Vertex Shader stage processes vertices, performing operations such as transformations, skinning, and lighting. Vertex Shaders always operate on a single input vertex and produce a single output vertex. This stage must always be active.
Specific operational details of Vertex Shaders are provided here(9).
The Hull Shader operates once per Patch (can only be used with Patces from the IA). It can transform input Control Points that make up a Patch into Output Control Points, and it can perform other setup for the fixed-function Tessellator stage (outputting TessFactors, which are numbers that indicate how much to tessellate).
Specific operational details of the Hull Shader are provided here(10).
The Tessellator is a fixed function unit whose operation is defined by declarations in the Hull Shader. It operates once per Patch output by the Hull Shader. The Hull shader outputs TessFactors which are numbers that tell the Tessellator how much to tessellate (generate geometry and connectivity) over the domain of the Patch.
Specific operational details of the Tessellator provided here(11).
The Domain Shader is invoked once per vertex generated by the Tessellator. Each invocation is identified by its coordinate on a generic domain, and the role of the Domain Shader is to turn that coordinate into something tangible (such as a point in 3D space) for use downstream. Each Domain Shader invocation for a Patch also sees shared input of all the Hull Shader output (such as output Control Points).
Specific operational details of the Domain Shader are provided here(12).
The Geometry Shader runs application-specified Shader code with vertices as input and the ability to generate vertices on output. The Geometry Shader's inputs are the vertices for a full primitive (two vertices for lines, three vertices for triangles, a single vertex for point, or all Control Points for a Patch if it reaches the GS with Tessellation disabled). Some types of primitives can also include the vertices of edge-adjacent primitive (an additional two vertices for a line, an additional three for a triangle).
Another input is a PrimitiveID auto-generated by the IA. This allows per-face data to be fetched or computed if desired.
The Geometry Shader stage is capable of outputting multiple vertices forming a single selected topology (GS output topologies available are: tristrip, linestrip, pointlist). The number of primitives emitted can vary freely within any invocation of the Geometry Shader, though the maximum number of vertices that could be emitted must be declared statically. Strip lengths emitted from a GS invocation can be arbitrary (there is a 'cut'(22.8.1) command).
Output may be fed to rasterizer and/or out to vertex Buffers in memory. Output fed to memory is expanded to individual point/line/triangle lists (the same way they would get passed to the rasterizer).
Algorithms that can be implemented in the Geometry Shader include:
Specific operational details of the Geometry Shader are provided here(13).
Vertices may be streamed out to memory just before arriving at the Rasterizer. This is like a "tap" in the Pipeline, which can be turned on even as data continues to flow down to the Rasterizer. Data sent out via Stream Output is concatenated to Buffer(s). These Buffers may on subsequent passes be recirculated as Pipeline inputs.
One constraint about Stream Output is that it is tied to the Geometry Shader, in that both must be created together (though either can be "NULL"/"off"). The particular memory Buffer(s) being Streamed out are not tied to this GS/SO pair though. Only the description of which parts of vertex data to feed to Stream Output are tied to the GS.
One use for Stream Output is for saving ordered Pipeline data that will be reused. For example a batch of vertices might be "skinned" by passing the vertices into the Pipeline as if they are independent points (just to visit all of them once), applying "skinning" operations on each vertex, and streaming out the results to memory. The saved out "skinned" vertices are now available for use in subsequent passes as input.
Since the amount of output written through Stream Output can be unpredictably dynamic, a special type of Draw command, DrawAuto(8.9), is necessary. DrawAuto enables the Input Assembler to know how much data was dynamically written to a Stream Output Buffer without CPU involvement. In addition, Queries are necessary to mitigate Stream Output overflow(20.4.10), as well as retrieve how much data was written(20.4.9) to the Stream Output Buffers.
Specific operational details of the Stream Output are provided here(14).
The rasterizer is responsible for clipping, primitive setup, and determining how to invoke Pixel Shaders. D3D11.3 does not view this as a "stage" in the Pipeline, but rather an interface between Pipeline stages which happens to perform a significant set of fixed function operations, many of which can be adjusted by software developers.
The rasterizer always assumes input positions are provided in clip-space, performs clipping, perspective divide and applies viewport scale/offset.
Specific operational details of the Rasterizer are provided here(15).
Input data available to the Pixel Shader includes vertex attributes that can be chosen, on a per-Element basis, to be interpolated with or without perspective correction, or be treated as constant per-primitive.
The Pixel Shader can also be chosen to be invoked either once per pixel or once per covered sample within the pixel.
Outputs are one or more 4-vectors of output data for the current pixel or sample, or no color (if pixel is discarded).
The Pixel Shader has some other inputs and outputs available as well, similar to the kind of inputs and outputs the Compute Shader can use, allowing, for instance, the ability to write to scattered locations.
Specific operational details of Pixel Shaders are provided here(16).
The final step in the logical Pipeline is visibility determination, through stencil or depth, and writing or blending of output(s) to RenderTarget(s), which may be one of many Resource Types(5).
These operations, as well as the binding of output resources (RenderTargets), are defined at the Output Merger.Specific operational details of the Output Merger are provided here(17).
The Compute Shader allows the GPU to be viewed as a generic grid of data-parallel processors, without any graphics baggage from the graphics pipeline. The Compute Shader has explicit access to fast shared memory to facilitate communication between groups of shader invocations, and the ability to perform scattered reads and writes to memory. The availablility of atomic operations enables unique access to shared memory addresses. The Compute Shader is not part of the Graphics Pipeline (all the previously discussed shader stages). The Compute Shader exists on its own, albeit on the same device as all the other Shader Stages. To invoke this shader, Dispatch*() APIs are called instead of Draw*().
Specific operational details of Compute Shaders are provided here(18).
Chapter Contents
(back to top)
3.1 Floating Point Rules
3.2 Data Conversion
3.3 Coordinate Systems
3.4 Rasterization Rules
3.5 Multisampling
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
Section Contents
(back to chapter)
3.1.1 Overview
3.1.2 Term: Unit-Last-Place (ULP)
3.1.3 32-bit Floating Point
D3D11 supports several different floating point representations for storage. However, all floating point computations in D3D11, whether in Shader programs written by application developers or in fixed function operations such as texture filtering or RenderTarget blending, are required to operate under a defined subset of the IEEE 754 32-bit single precision floating point behavior.
One ULP is the smallest representable delta from one value in a numeric representation to an adjacent value. The absolute magnitude of this delta varies with the magnitude of the number in the case of a floating point number. If, hypothetically, the result of an arithmetic operation were allowed to have a tolerance 1 ULP from the infinitely precise result, this would allow an implementation that always truncated its result (without rounding), resulting in an error of at most one unit in the last (least significant) place in the number representation. On the other hand, it would be much more desirable to require 0.5 ULP tolerance on arithmetic results, since that requires the result be the closest possible representation to the infinitely precise result, using round to nearest-even.
Here is a summary of expected 32-bit floating point behaviors for D3D11. Some of these points choose a single option in cases where IEEE-754 offers choices. This is followed by a listing of deviations or additions to IEEE-754 (some of which are significant). Refer to IEEE-754 for topics not mentioned.
The IEEE-754R specification for floating point min and max operations states that if one of the inputs to min or max is a "quiet" NaN, then the result of the operation is the other parameter. For example:
min(x,QNaN) == min(QNaN,x) == x (same for max)
A recent revision of the IEEE-754R specification seems to have adopted a different behavior for min and max when one input is a "signaling" SNaN value vs if it was QNaN:
min(x,SNaN) == min(SNaN,x) == QNaN (same for max)
This latter change was not in place until after D3D10 had shipped, and even after the D3D11 specifications had become fairly mature and locked down. So, even though the intent in general for D3D is to follow the standards for arithmetic: IEEE-754 and IEEE-754R, in this case there is a deviation. Future D3D versions may consider relaxing the rules allow either behavior, although compatibility will be a concern in addition having to justify the value of distinguishing QNaN vs SNaN in general. As for D3D11, it cannot change behavior here at this point, so it matches D3D10 as follows:
The arithmetic rules in D3D10+ do not make any distinctions between "quiet" and "signaling" NaN values (QNaN vs SNaN). All "NaN" values are handled the same way. In the case of min() and max(), the D3D behavior for any NaN value is like how QNaN is handled in IEEE-754R definition above. (For completeness - if both inputs are NaN, any NaN value is returned.)
Double-precision floating-point support is optional, however all double-precision floating point instructions listed in this spec (here (arithmetic)(22.14), here (conditional)(22.15), here (move)(22.16) and here (type conversion)(22.17) ) must be implemented if double support is enabled.
Double-precision floating-point usage is indicated at compile time by declaring shadel model 5_a. Support for Shader Model 5.0a will be reportable by drivers and discoverable by users via an API.
When supported, double-precision instructions match IEEE 754R behavior requirements (with the exception of double precision reciprocal(22.14.5) which is permitted 1.0 ULP tolerance and the exact result if representable).
An exception to the 4-vector register convention exists for double-precision floating-point instructions, which operate on pairs of doubles. Double-precision floating-point values are in IEEE 754R format. One double is stored in .xy with the least significant 32 bits in x, and the most significant 32 bits in y. Similarly the second double is stored in .zw with the least significant 32 bits in z, and the most significant 32 bits in w.
The permissible swizzles for double operations are .xyzw, .xyxy, .zwxy, .zwzw. The permissible write masks for double operations are .xy, .zw, and .xyzw.
Support for generation of denormalized values is required for double-precision data (no flush-to-zero behavior). Likewise, instructions do not read denormalized data as a signed zero - they honor the denorm value.
Several resource formats in D3D11 contain 16-bit representations of floating point numbers. This section describes the float16 representation.
Format:
A float16 value, v, made from the format above takes the following meaning:
32-bit floating point rules also hold for 16-bit floating point numbers, adjusted for the bit layout described above.
The exceptions are:
A single resource format in D3D11 contains 11-bit and 10-bit representations of floating point numbers. This section describes the float11 and float10 representations.
Format:
A float11/float10 value, v, made from the format above takes the following meaning:
32-bit floating point rules also hold for 11-bit and 10-bit floating point numbers, adjusted for the bit layout described above.
The exceptions are:
Section Contents
(back to chapter)
3.2.1 Overview
3.2.2 Floating Point Conversion
3.2.3 Integer Conversion
This section describes the rules for various data conversions in D3D11. Other relevant information regarding data conversion is in the Data Invertability(19.1.2) section.
Whenever a floating point conversion between different representations occurs, including to/from non-floating point representations, the following rules apply.
These are rules for converting from a higher range representation to a lower range representation:
These are rules for converting from a lower precision/range representation to a higher precision/range representation:
The following set of terms are subsequently used to characterize various integer format conversions.
Term | Definition |
---|---|
SNORM | Signed normalized integer, meaning that for an n-bit 2's complement number, the maximum value means 1.0f (e.g. the 5-bit value 01111 maps to 1.0f), and the minimum value means -1.0f (e.g. the 5-bit value 10000 maps to -1.0f). In addition, the second-minimum number maps to -1.0f (e.g. the 5-bit value 10001 maps to -1.0f). There are thus two integer representations for -1.0f. There is a single representation for 0.0f, and a single representation for 1.0f. This results in a set of integer representations for evenly spaced floating point values in the range (-1.0f...0.0f), and also a complementary set of representations for numbers in the range (0.0f...1.0f) |
UNORM | Unsigned normalized integer, meaning that for an n-bit number, all 0's means 0.0f, and all 1's means 1.0f. A sequence of evenly spaced floating point values from 0.0f to 1.0f are represented. e.g. a 2-bit UNORM represents 0.0f, 1/3, 2/3, and 1.0f. |
SINT | Signed integer. 2's complement integer. e.g. an 3-bit SINT represents the integral values -4, -3, -2, -1, 0, 1, 2, 3. |
UINT | Unsigned integer. e.g. a 3-bit UINT represents the integral values 0, 1, 2, 3, 4, 5, 6, 7 |
FLOAT | A floating-point value in any of the representations defined by D3D11. |
SRGB | Similar to UNORM, in that for an n-bit number, all 0's means 0.0f and all 1's means 1.0f. However unlike UNORM, with SRGB the sequence of unsigned integer encodings between all 0's to all 1's represent a nonlinear progression in the floating point interpretation of the numbers, between 0.0f to 1.0f. Roughly, if this nonlinear progression, SRGB, is displayed as a sequence of colors, it would appear as a linear ramp of luminosity levels to an "average" observer, under "average" viewing conditions, on an "average" display. For complete detail, refer to the SRGB color standard, IEC 61996-2-1, at IEC (International Electrotechnical Commission) |
Note that the terms above are also used as Format Name Modifiers(19.1.3.2), where they describe both how data is layed out in memory and what conversion to perform in the transport path (potentially including filtering) from memory to/from a Pipeline unit such as a Shader. See the Formats(19.1) section to see exactly how these names are used in the context of resource formats.
What follows are descriptions of conversions from various representations described above to other representations. Not all permutations are shown, but at least all the ones that show up in D3D11 somewhere are shown.
Unless otherwise specified for specific cases, all conversions to/from integer representations to float representations described below must be done exactly. Where float arithmetic is involved, FULL IEEE-754 precision is required (1/2 ULP(3.1.2) of the infinitely precise result), which is stricter than the general D3D11 Floating Point Rules(3.1).
Given an n-bit integer value representing the signed range [-1.0f to 1.0f], conversion to floating-point is as follows:
Given a floating-point number, conversion to an n-bit integer value representing the signed range [-1.0f to 1.0f] is as follows:
This conversion is permitted tolerance of 0.6f ULP(3.1.2) (on the integer side). This means that after converting from float to integer scale, any value within 0.6f ULP(3.1.2) of a representable target format value is permitted to map to that value. The additional Data Invertability(19.1.2) requirement ensures that the conversion is nondecreasing across the range and all output values are attainable.
Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.
This conversion is permitted tolerance of 0.6f ULP(3.1.2) (on the integer side). This means that after converting from float to integer scale, any value within 0.6f ULP(3.1.2) of a representable target format value is permitted to map to that value. The additional Data Invertability(19.1.2) requirement ensures that the conversion is nondecreasing across the range and all output values are attainable.
Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.
The following is the ideal SRGB to FLOAT conversion.
This conversion will be permitted a tolerance of 0.5f ULP(3.1.2) (on the SRGB side). The procedure for measuring this tolerance, given that it is relative to the SRGB side even though the result is a FLOAT, is to convert the result back into SRGB space using the ideal FLOAT -> SRGB conversion specified below, but WITHOUT the rounding to integer, and taking the floating point difference versus the original SRGB value to yield the error. There are a couple of exceptions to this tolerance, where exact conversion is required: 0.0f and 1.0f (the ends) must be exactly achievable.
The following is the ideal FLOAT -> SRGB conversion.
Assuming the target SRGB color component has n bits:
This conversion is permitted tolerance of 0.6f ULP(3.1.2) (on the integer side). This means that after converting from float to integer scale, any value within 0.6f ULP(3.1.2) of a representable target format value is permitted to map to that value. The additional Data Invertability(19.1.2) requirement ensures that the conversion is nondecreasing across the range and all output values are attainable.
Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.
To convert from SINT to an SINT with more bits, the MSB bit of the starting number is "sign-extended" to the additional bits available in the target format.
To convert from UINT to an SINT with more bits, the number is copied to the target format's LSBs and additional MSB's are padded with 0.
To convert from SINT to UINT with more bits: If negative, the value is clamped to 0. Otherwise the number is copied to the target format's LSBs and additional MSB's are padded with 0.
To convert from UINT to UINT with more bits the number is copied to the target format's LSBs and additional MSB's are padded with 0.
To convert from a SINT or UINT to SINT or UINT with fewer or equal bits (and/or change in signedness), the starting value is simply clamped to the range of the target format.
Fixed point integers are simply integers of some bit size that have an implicit decimal point at a fixed location. The ubiquitous "integer" data type is a special case of a fixed point integer with the decimal at the end of the number. Fixed point number representations are characterized as: i.f, where i is the number of integer bits and f is the number of fractional bits. e.g. 16.8 means 16 bits integer followed by 8 bits of fraction. The integer part is stored in 2's complement, at least as defined here (though it can be defined equally for unsigned integers as well). The fractional part is stored in unsigned form. The fractional part always represents the positive fraction between the two nearest integral values, starting from the most negative. Exact details of fixed point representation, and mechanics of conversion from floating point numbers are provided below.
Addition and subtraction operations on fixed point numbers are performed simply using standard integer arithmetic, without any consideration for where the implied decimal lies. Adding 1 to a 16.8 fixed point number just means adding 256, since the decimal is 8 places in from the least significant end of the number. Other operations such as multiplication, can be performed as well simply using integer arithmetic, provided the effect on the fixed decimal is accounted for. For example, multiplying two 16.8 integers using an integer multiply produces a 32.16 result.
Fixed point integer representations are used in a couple of places in D3D11:
The following is the general procedure for converting a floating point number n to a fixed point integer i.f, where i is the number of (signed) integer bits and f is the number of fractional bits:
Note: Sign of zero is preserved.
For D3D11 implementations are permitted 0.6f ULP(3.1.2) tolerance in the integer result vs. the infinitely precise value n*2^f after the last step above.
The diagram below depicts the ideal/reference float to fixed conversion (including round-to-nearest-even), yielding 1/2 ULP accuracy to an infinitely precise result, which is more accurate than required by the tolerance defined above. Future D3D versions will require exact conversion like this reference.
Specific choices of bit allocations for fixed point integers are listed in the places in the D3D11 spec where they are used.
Assume that the specific fixed point representation being converted to float does not contain more than a total of 24 bits of information, no more than 23 bits of which is in the fractional component. Suppose a given fixed point number, fxp, is in i.f form (i bits integer, f bits fraction). The conversion to float is akin to the following pseudocode:
float result = (float)(fxp >> f) + // extract integer ((float)(fxp & (2f - 1)) / (2f)); // extract fraction
Although the situation rarely, if ever arises, consider that a number that originates as fixed point, gets converted to float32, and then gets converted back to fixed point will remain identical to its original value. This holds provided that bit representation for the fixed point number does not contain more information than can be represented in a float32. This lossless conversion property does not hold when making the opposite round-trip, starting from float32, moving to fixed-point, and back; indeed lossy conversion is in fact the "point" of converting from float32 to fixed-point in the first place.
One final note on round-trip conversion. Observe that when the float32 number -2.75 is converted to fixed-point, it becomes -3 +0.25, that is, the integer part is negative but the fixed point part, considered by itself, is positive. When that is converted back to float32, it becomes -2.75 again, since floating point stores negative numbers in sign-magnitude form, instead of in two's complement form.
Section Contents
(back to chapter)
3.3.1 Pixel Coordinate System
3.3.2 Texel Coordinate System
3.3.3 Texture Coordinate Interpretation
The Pixel Coordinate System defines the origin as the upper-left corner of the RenderTarget. Pixel centers are therefore offset by (0.5f,0.5f) from integer locations on the RenderTarget. This choice of origin makes rendering screen-aligned textures trivial, as the pixel coordinate system is aligned with the texel coordinate system.
D3D9 and prior had a terrible Pixel Coordinate System where the origin was the center of the top left pixel on the RenderTarget. In other words, the origin was (0.5,0.5) away from the upper left corner of the RenderTarget. There was the nice property that Pixel centers were at integer locations, but the fact this was misaligned with the texture coordinate system frequently burned unsuspecting developers. Further, with Multisample rendering, thre was a 1/2 pixel wide region of the RenderTarget along the top and left edge that the viewport could not cover. D3D11 allows applications that want to emulate this behavior to specify a fractional offset to the top left corner of the viewport (-0.5,-0.5).
The texel coordinate system has its origin at the top-left corner of the texture. See the "Texel Coordinate System" diagram below. This is consistent with the Pixel Coordinate System.
The memory load instructions like sample(22.4.15) or ld(22.4.6) have a couple of ways texture coordinates are interpreted (normalized float, or scaled integer respectively). The "Texture Coordinate Interpretation" diagram below describes how these interpretations get mapped to specific texel(s), for point and linear sampling. The diagram does not illustrate address wrapping, which occurs after the shown equations for computing texel locations. The addressing math shown in this diagram is only a general guideline, and exact definition of texel selection arithmetic is provided in the Texture Sampling(7.18) section, including the role of Fixed Point(3.2.4.1) snapping of precision in the addressing process.
Section Contents
(back to chapter)
3.4.1 Coordinate Snapping
3.4.2 Triangle Rasterization Rules
Consider a set of vertices going through the Rasterizer, after having gone through clipping, perspective divide and viewport scale. Suppose that any further primitive expansion has been done (e.g. rectangular lines can be drawn by implementations as 2 triangles, described later). After the final primitives to be rasterized have been obtained, the x and y positions of the vertices are snapped to exactly n.8 fixed point integers. Any front/back culling is applied (if applicable) after vertices have been snapped. Interpolation of pixel attributes is set up based on the snapped vertex positions of primitives being rasterized.
Any pixel sample locations which fall inside the triangle are drawn. An example with a single sample per pixel (at the center) is shown below. If a sample location falls exactly on the edge of the triangle, the Top-Left Rule applies, to ensure that adjacent triangles do not overdraw. The Top-Left rule is described below.
Top edge: If an edge is exactly horizontal, and it is above the other edges of the triangle in pixel space, then it is a "top" edge.
Left edge: If an edge is not exactly horizontal, and it is on the left side of the triangle in pixel space, then it is a "left" edge. A triangle can have one or two left edges.
Top-Left Rule: If a sample location falls exactly on the edge of a triangle, the sample is inside the triangle if the edge is a "top" edge or a "left" edge. If two edges from the same triangle touch the pixel center, then if both edges are "top" or "left" then the sample is inside the triangle.
Rasterization rules for infinitely-thin lines, with no antialiasing, are described below.
One futher implication of these line rasterization rules is that lines that are geometrically clipped to the viewport extent may set one less pixel than lines that are rendered to a larger 2D extent with the pixels outside the viewport discarded. (This is due to the handling of the line endpoints.)
Since geometric clip to the viewport is neither required nor disallowed, aliased line rendering is allowed to differ in viewport-edge pixels due to geometric clipping.
The alpha-based antialiased rasterization of a line (defined by two end vertices) is implemented as the visualization of a rectangle, with the line's two vertices centered on two opposite "ends" of the rectangle, and the other two edges separated by a width (in D3D11 width is only 1.0f). No accounting for connected line segments is done. The region of intersection of this rectangle with the RenderTarget is estimated by some algorithm, producing "Coverage" values [0.0f..1.0f] for each pixel in a region around the line. The Coverage values are multiplied into the Pixel Shader output o0.a value before the Output Merger Stage. Undefined results are produced if the PS does not output o0.a. D3D11 exposes no controls for this line mode.
It is deemed that there is no single "best" way to perform alpha-based antialiased line rendering. D3D11 adopts as a guideline the method shown in the diagram below. This method was derived empirically, exhibiting a number of visual properties deemed desirable. Hardware need not exactly match this algorithm; tests against this reference shall have "reasonable" tolerances, guided by some of the principles listed further below, permitting various hardware implementations and filter kernel sizes. None of this flexibility permitted in hardware implementation, however, can be communicated up through D3D11 to applications, beyond simply drawing lines and observing/measuring how they look.
The following is a listing of the "nice" properties that fall out of the above algorithm, which in general will be expected of hardware implementations (admittedly many of which are likely difficult to test):
Note that the wider the filter kernel an implementation uses, the blurrier the line, and thus the more sensitive the resulting perceived line intensity is to display gamma. The reference implmentation's kernel is quite large, at 3x3 pixel units about each pixel.
Quadrilateral lines take 2 endpoints and turn them into a simple rectangle with width 1.4f, drawn with triangles. The attributes at each end of the line are duplicated for the 2 vertices at each end of the rectangle.
This mode is not supported with center sample patterns (D3D11_CENTER_MULTISAMPLE_PATTERN) where there is more than one sample overlapping the center of the pixel, in which case results of drawing this style of line are undefined. See here(19.2.4.1).
For the purpose of rasterization, a point is represented as a square of width 1 oriented to the RenderTarget. Actual implementation may vary, but output behavior should be identical to what is described here. The coordinate for a point indentifies where the center of the square is located. Pixel coverage for points follows Triangle Rasterization Rules, interpreted as though a point is composed of 2 triangles in a Z pattern, with attributes duplicated at the 4 vertices. Cull modes do not apply to points.
Section Contents
(back to chapter)
3.5.1 Overview
3.5.2 Warning about the MultisampleEnable State
3.5.3 Multisample Sample Locations And Reconstruction
3.5.4 Effects of Sample Count > 1
Multisample Antialiasing seeks to fight geometry aliasing, without necessarily dealing with surface aliasing (leaving that as a shading problem, e.g. texture filterng). This is accomplished by performing pixel coverage tests and depth/stencil tests at multiple sample locations per pixel, backed by storage for each sample, while only performing pixel shading calculations once for covered pixels (broadcasting Pixel Shader output across covered samples). It is also possible to request Pixel Shader invocations to occur at sample-frequency rather than at pixel-frequency.
The MultisampleEnable Rasterizer State remains as an awkward leftover from D3D9. It no longer does what the name implies; it no longer has any bearing on multisampling; it only controls line rendering behavior now. The state should have been renamed/refactored, but the opportunity was missed in D3D11. For a detailed discussion about what this state actually does now, see State Interaction With Point/Line/Triangle Rasterization Behavior(15.14).
Specifics about sample locations and reconstruction functions for multisample antialiasing are dependent on the chosen Multisample mode, which is outside the scope of this section. See Multisample Format Support(19.2), and Specification of Sample Positions(19.2.4).
Rasterization behavior when sample count is greater than 1 is simply that primitive coverage tests are done for each sample location within a pixel. If one or more sample locations in a pixel are covered, the Pixel Shader is run once for the pixel in Pixel-Frequency mode, or in Sample-Frequency mode once for each covered sample that is also in the Rasterizer SampleMask. Pixel-frequency execution produces a single set of Pixel Shader output data that is replicated to all covered samples that pass their individual depth/stencil tests and blended to the RenderTarget per-sample. Sample-frequency execution produces a unique set of Pixel Shader output data per covered sample (and in SampleMask), each output getting blended 1:1 to the corresponding RenderTarget sample if its depth/stencil test passes.
Note that points(3.4.6) and quadrilateral lines(3.4.5) are functionally equivalent to drawing their area with triangles. So Sample-Frequency execution is easily defined for all of these primitives. For points, the samples covered by the point area (and in the RasterizerState's SampleMask) each get Pixel Shader invocations with attributes replicated from its single vertex (except one parameter is available that is varying - an ID identifying each sample from the total set of samples in the pixel). For quadrilateral lines, the two end vertices define how attributes interpolate along the length, staying constant across the perpendicular. Again, the samples covered by the area of the primitive (and in the SampleMask) each get a Pixel Shader invocations in Sample-Frequency execution mode, with unique input attributes per sample, including an ID identifying which sample it is.
Alpha-Antialiased Lines(3.4.4) and Aliased Lines(3.4.3) are algorithms that inherently do not deal with discrete sample locations within a pixel's area, and thus it is illegal/undefined to request Sample-Frequency execution for these primitives, unless the sample count is 1, which is identical to Pixel-Frequency execution.
Consider a Pixel Shader that operates only on pixel-frequency inputs (e.g. all attributes have one of the following interpolation modes(16.4): constant, linear, linear_centroid, linear_noperspective or linear_noperspective_centroid). Implementations need only execute the shader once per pixel and replicate the results to all samples in the pixel. Now suppose code is added to the shader that generates new outputs based on reading sample-frequency inputs. The existing pixel-frequency part of the shader behaves identically to before. Even though the shader will now execute at sample-frequency (so the new outputs can vary per-sample), each invocation produces the same result for the original outputs.
Though this example happens to separate out the different interpolation frequencies to highlight their invariance, of course it is perfectly valid in general for shader code to mix together inputs with any different interpolation modes.
When a sample-frequency interpolation mode(16.4) is not needed on an attribute, pixel-frequency interpolation-modes such as linear evaluate at the pixel center. However with sample count > 1 on the RenderTarget, attributes could be interpolated at the pixel center even though the center of the pixel may not be covered by the primitive, in which case interpolation becomes "extrapolation". This "extrapolation" can be undesirale in some cases, so short of going to sample-frequency interpolation, a compromise is the centroid interpolation mode.
Centroid behaves exactly as follows:
The term Conservative Rasterization has been used to describe basically a GPU rasterizer assist for shader computed antialiasing. This concept has not been actually implemented in GPUs, at least that are known, but the following short discussion of Conservative Rasterization somewhat motivates the alternative that is specified here - Target Independent Rasterization. Note that as of D3D11.3, hardware has evolved to support Conservative Rasterization(15.17).
Consider how multisampling works in D3D (or GPU rasterization in general). Each pixel has “sample” positions which cause Pixel Shaders to be invoked when primitives (e.g. triangles) cover the samples. For multisampling, a single Pixel Shader invocation occurs when at least one sample in a pixel is covered. Alternatively, D3D10.1+ also allows the shader to request that the Pixel Shader be invoked for each covered sample – this has historically been called “supersampling”.
The downside to these antialiasing approaches is they are based on a discrete number of samples. The more samples the better, but there are still holes in the pixel area between the sample points in which geometry rendered there does not contribute to the image.
Conservative Rasterization, instead, would ideally invoke the Pixel Shader if the area of a primitive (e.g. triangle) being rendered has any chance of intersecting with the pixel’s square area. It would then be up to shader code to compute whatever measure of pixel area intersection it desires. It may be acceptable for the rasterization to be “conservative” in that triangles/primitives are simply rasterized with a fattened screen space area that could include some pixels with no actual coverage – it doesn’t really matter since the shader will be computing the actual coverage.
The win is that the number of Pixel Shader invocations is reasonably bounded to the triangle extents (as opposed to rendering bounding rectangles), and the output can be “perfect” antialiasing if desired. This is particularly the case if also utilizing some other features in D3D11 that allow arbitrary length lists to be recorded per pixel.
However, the complexity of the shader code required to compute an analytic coverage solution with Conservative Rasterization might be too high for the benefit. An alternative scheme, Target Independent Rasterization is defined here, under the more mundane heading 'Forcing Rasterizer Sample Count' below. First though, some discussion about how Target Independent Rasterization can help in at least one scenario - path rendering in Direct2D.
A common usage scenario of Direct2D is to stroke and/or fill anti-aliased paths. The semantics of the Direct2D anti-aliasing scheme are different from MSAA. The key difference is when the resolve step occurs. With MSAA the resolve step typically happens once per frame. With Direct2D anti-aliasing the resolve step occurs after each path is rendered. To work around these semantic differences the Windows 7 version of Direct2D performs rasterization on the CPU. When a path is to be filled or stroked, an expensive CPU-based algorithm computes the percentage of each pixel that is covered by the path. The GPU is used to multiply the path color by the coverage and blend the results with the existing render target contents. This approach is heavily CPU-bound.
Target Independent Rasterization enables Direct2D to move the rasterization step from the CPU to the GPU while still preserving the Direct2D anti-aliasing semantics. Rendering of anti-aliased paths will be performed in 2 passes on the GPU. The first pass will write per-pixel coverage to an intermediate render target texture. Paths will be tessellated into non-overlapping triangles. The GPU will be programmed to use Target Independent Rasterization and additive blending during the first pass. The pixel shader used in the first pass will simply count the number of bits set in the coverage mask and output the result normalized to [0.0,1.0]. During the second pass the GPU will read from the intermediate texture and write to the application’s render target. This pass will multiply the path color by the coverage computed during the first pass.
In some cases, it will be faster for Direct2D to tessellate paths into potentially overlapping triangles. In these cases, the 1st pass will set the ForcedSampleCount to 16 and simply output the coverage mask to the intermediate (R16_UINT). The blender would be setup to do a bitwise OR, or XOR operation (depending on the scenario). The second pass would read this 16-bit value from the intermediate, count the number of bits set, and modulate the color being written to the render target.
There are 2 fallbacks that could be used to implement this algorithm on GPUs that do not support Target Independent Rasterization. The first fallback would render the scene N times, with alpha = 1/N and additive blending for the first step of the algorithm. This would produce the same results, but at the cost of resorting to multipass rendering to to mimic the effect of supersampling at the rasterizer. The second fallback would use MSAA to implement the first pass of the algorithm. Both fallbacks are bound by memory bandwidth (render target writes). Using Target Independent Rasterization would significantly reduce the memory bandwidth requirements of this algorithm.
Overriding the Rasterizer sample count means defining the multisample pattern at the Rasterizer independent of what RenderTargetViews(5.2) (or UnorderedAccessView(5.3.9)s) may be bound at the Output Merger (and their associated sample count / Quality Level).
The ForcedSampleCount state setting is located in the Rasterizer State(15.1) object.
UINT ForcedSampleCount; // Valid values for Target Independent Rasterization (TIR): 0, 1, 4, 8, 16 // Valid values for UAV(5.3.9) only render: 0, 1, 4, 8, 16 // 0 means don't force sample count.
Devices must support all the standard sample patterns up to and including 16 for the ForcedSampleCount. This is even if the device does not support that many samples in RenderTarget / DepthStencil resources.
Investigations show that the 16 sample standard D3D pattern performs favorably with Direct2D's original software based rasterization pattern, which had the significant disadvantage of using a regular grid layout, even though it was 64 samples.
With a forced sample count/pattern selected at the rasterizer (ForcedSampleCount > 0), pixels are candidates for shader invocation based on the selected sample pattern, independent of the RTV ("output") sample count. The burden is then on shader code to make sense of the possible mismatch between rasterizer and output storage sample count, given the defined semantics.
Here are the behaviors with ForcedSampleCount > 0.
The above functionality is required for Feature Level 11_1 hardware.
D3D10.0 - D3D11.0 hardware (and Feature Level 10_0 - 11_0) supports ForcedSampleCount set to 1 (and any sample count for RTV) along with the described limitations (e.g. no depth/stencil).
For 10_0, 10_1, and 11_0 hardware, when ForcedSampleCount is set to 1, line rendering cannot be configured to 2-triangle (quadrilateral) based mode (i.e. the MultisampleEnable state cannot be set to true). This limitation isn't present for 11_1 hardware. Note the naming of the 'MultisampleEnable' state is misleading since it no longer has anything to do with enabling multisampling; instead it is now one of the controls along with AntialiasedLineEnable for selecting line rendering mode.
This limited form of Target Indepdendent Rasterization, ForcedSampleCount = 1, closely matches a mode that was present in D3D10.0 but due to API changes became unavailable for D3D10.1 and D3D11 (and Feature Levels 10_1 and 11_0). In D3D10.0 this mode was the center sampled rendering even on an MSAA surface that was available when MultisampleEnable was set to false (and this could be toggled by toggling MultisampleEnable). In D3D10.1+, MultisampleEnable no longer affects multisampling (despite the name) and only controls line rendering behavior. It turns out some software, such as Direct2D, depended on this mode to be able to render correctly on MSAA surfaces. As of D3D11.1, D2D can use ForcedSampleCount = 1 to bring back this mode consistently on all D3D10+ hardare and Feature Levels. D3D10.0 also supported depth testing in this mode as well, but it is not worth exposing that given it D2D did not expose it, and the full D3D11.1 definition of the feature doesn't work with depth/stencil.
D3D11 allows rasterization with only UAVs bound, and no RTVs/DSVs. Even though UAVs can have any/different sizes, essentially, the viewport/scissor identify the pixel dimensions. Before this feature, when rendering with only UAVs bound, the rasterizer was limited to a single sample only.
UAV(5.3.9)-only rendering with multisampling at the rasterizer is possible by keying off the ForcedSampleCount state described earlier, with the sample patterns limited to 0, 1, 4, 8 and 16. (The UAVs themselves are not multisampled in terms of allocation.) A setting of 0 is equivalent to the setting 1 - single sample rasterization.
Shaders can request pixel-frequency invocation with UAV-only rendering, but requesting sample-frequency invocation is invalid (produces undefined shading results).
The SampleMask Rasterizer State does not affect rasterization behavior at all here.
On D3D11.0 hardware, ForcedSampleCount can be 0, 1, 4 and 8 with UAV only Rasterization. D3D11.1 hardware additionally supports 16.
Attempting to render with unsupported ForcedSampleCount produces undefined rendering results - though if a ForcedSampleCount is chosen that could never be valid for TIR or UAV-only rendering the runtime will fail the Rasterizer State object creation immediately.
Pixel Shaders always run in minimum 2x2 quanta to be able to support derivative calculations, regardless of the RenderTarget sample count. These Pixel Shader derivative calculations, used in texture filtering operations, but also available directly in shaders, are calculated by taking deltas of data in adjacent pixels. This requires data in each pixel has been sampled with unit spacing horizontally or vertically.
RenderTarget sample counts > 1 do not affect derivative calculation methods. If derivatives are requested on an attribute that has been Centroid sampled, the hardware calculation is not adjusted, and therefore incorrect derivatives will often result. What the Shader expects to be a derivative wrt a unit distance in the x or y direction in RenderTarget space will actually be the rate of change with respect to some other direction vector, which also probably isn't unit length.
The point here is that it is the application's responsibility to exhibit caution when requesting derivative from Centroid sampled attributes, ideally never requesting them at all. Centroid sampling can be useful for situations where it is critical that a primitive's interpolated attributes are not "extrapolated", but this comes with some tradeoffs: First, centroid sampled attributes may appear to jump around as a primitive edge moves over a pixel, rather than changing continuously. Secondly, derivative calculations on the attributes become unreliable or difficult to use correctly (which also hurts texture sampling operations that derive LOD from derivatives).
Under sample-frequency execution, a 2x2 quad of Pixel Shaders executes for each sample index where that sample is covered in at least one of the pixels participating in the 2x2 quad. This allows derivatives to be calculated in the usual way since any given sample is located one unit apart horizonally or vertically from the corresponding sample in the neighboring pixels.
It is left to the application's shader author to decide how to adjust for the fact that derivatives calculated from spacings of one unit may need to be scaled in some way to reflect higher frequency shader execution, depending on the sample pattern/count.
Further important discussion of Pixel Shader derivatives is under Interaction of Varying Flow Control With Screen Derivatives(16.8).
Chapter Contents
(back to top)
4.1 Minimal Pipeline Configurations
4.2 Fixed Order of Pipeline Results
4.3 Shader Programs
4.4 The Element
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
The rendering Pipeline encapsulates all state related to the rendering of a primitive. This includes a sequence of pipeline stages as well as various state objects.
Section Contents
(back to chapter)
4.1.1 Overview
4.1.2 No Buffers at Input Assembler
4.1.3 IA + VS (+optionally GS) + No PS + Writes to Depth/Stencil Enabled
4.1.4 IA + VS (+optionally GS) + PS (incl. Rasterizer, Output Merger)
4.1.5 IA + VS + SO
4.1.6 No RenderTarget(s) and/or Depth/Stencil and/or Stream Output
4.1.7 IA + VS + HS + Tessellation + DS + ...
4.1.8 Compute alone
4.1.9 Minimal Shaders
Not all Pipeline Stages must be active. This section clarifies this concept by illustrating some minimal configurations that can produce useful results. The Graphics pipeline is accessed by Draw* calls from the API. The alternative pipeline, Compute, is accessed by issuing Dispatch* calls from the API.
For the Graphics pipepine, the Input Assembler is always active, as it produces pipeline work items. In addition, the Vertex Shader is always active. Relying on the presence of the Vertex Shader at all times simplifies data flow permutations very significantly, versus allowing the Input Assembler with its limited programming flexibility to feed any pipeline stage.
Note that even though the Vertex Shader must always be active in the Graphics pipeline, in scenarios where applications really don't want to have a Vertex Shader, and must simply implement it as a trivial or nearly trivial sequence of mov's from inputs to outputs, the short length and simplicity of such "passthrough" shaders should not be a problem for hardware implementations to practically hide the cost of, one way or another.
A minimal use of the Input Assembler is to not have any input Buffers bound (vertex or index data). The Input Assembler can generate counters such as VertexID(8.16), InstanceID(8.18) and PrimitiveID(8.17), which can identicy vertices/primitives generated in the pipeline by Draw*(), or DrawIndexed*() (if at least an Index Buffer is bound). Thus Shaders can minimally drive all their processing based on the IDs if desired, including fetching appropriate data from Buffers or Textures.
If the shader stage before the rasterizer outputs position, and Depth/Stencil writes are enabled, the rasterizer will simply perform the fixed-function depth/stencil tests and updates to the Depth/Stencil buffer, even if there is no Pixel Shader active. No Pixel Shader means no updates to RenderTargets other than Depth/Stencil.
The Input Assembler + Vertex Shader (required) can drive the Pixel Shader directly (GS does not have to be used, but can be). If an application seeks to write data to RenderTarget(s), not including Depth/Stencil which were explained earlier, the Pixel Shader must be active. This implicitly Output Merger as well, though as described further below, there's no requirement that RenderTargets need to be bound just because rasterization is occuring.
The Input Assembler (+required VS) can feed Stream Output directly with no other stages active. Note that as described in the Stream Output Stage(14) section, Stream Output is tied to the Geometry Shader, however a "NULL" Geometry Shader can be specified, allowing the outputs of the Vertex Shader to be sent to Stream Output with no other stages active.
Whether or not the Pixel Shader is active, it is always legal to NOT have any output targets bound (and/or have output masks defined so that no output targets are written). Likewise for Stream Output. This might be interesting for performance tests which don't include output memory bandwidth (and which might examine feedback statistics such as shader invocation counts, which is itself a form of pipeline output anyway).
The Input Assembler (+required VS) can feed Stream Output directly with no other stages active. Note that as described in the Stream Output Stage(14) section, Stream Output is tied to the Geometry Shader, however a "NULL" Geometry Shader can be specified, allowing the outputs of the Vertex Shader to be sent to Stream Output with no other stages active.
Take any of the configurations above, and HS + Tessellator + DS can be inserted after the VS. The presence of the DS is what implises the presence of the Tessellator before it.
When the Compute Shader runs, it runs by itself. The state for both the Graphics pipeline shaders and Compute Shader can be simultaneously bound. The selection of which pipeline to use is Draw* invokes Graphics and Dispatch* invokes Compute.
All vertex shaders must have a minimum of one input and one output, which can be as little as one scalar value. Note that System Generated Values such as VertexID(8.16) and InstanceID(8.18) count as input.
The rendering Pipeline is designed to allow hardware to execute tasks at various stages in parallel. However observable rendering results must match results produced by serial processing of tasks. Whenever a task in the Pipeline could be performed either serially or in parallel, the results produced by the Pipeline must match serial operation. That is, the order that tasks enter the Pipeline is the order that tasks are observed to be propagated all the way through to completion. If a task moving through the Pipeline generates additional sub-tasks, those sub-tasks are completed as part of completing the spawning task, before any subsequent tasks are completed. Note that this does not prevent hardware from executing tasks out of order or in parallel if desirable, just as long as results are buffered appropriately such that externally visible results reflect serial execution.
One exception to this fixed ordering is with Tessellation. With the fixed function Tessellation stage, implementations are free to generate points and topology in any order as long as that order is consistent given the same input on the same device. Vertices can even be generated multiple times in the course of tessellating a patch, as long as the Tessellator output topology is not point (in which case only the unique points in the patch must be generated). This tessellator exception is discussed here(11.7.9).
Another exception to the fixed ordering of pipeline results is any access to an Unordered Transaction View of a Resource (for example via the Compute Shader or Pixel Shader). These types of Views explicitly allow unordered results, leaving the burden to applications to make careful choices of atomic instructions to access Unordered Transaction Views if deterministic and implementation invariant output is desired.
A Shader object encapsulates a Shader program for any type of Shader unit. All shaders have a common binary format and basically have the following typical layout. A helpful reference for this is the source code accompanying the Reference Rasterizer, which includes facilities for parsing the shader binary.
The Tessellation related shaders have a significantly different structure, particularly the Hull Shader, which appears as multiple phases of shaders concatenated together (not depicted here).
version input declarations output declarations resource declarations code version describes the Shader type: Vertex Shader(vs), Hull Shader (hs), Domain Shader (ds), Geometry Shader (gs), Pixel Shader (ps), Compute Shader (cs). Example: vs_5_0, ps_5_0 input declarations declare which input registers are read Example: dcl_input v[0] dcl_input v[1].xy dcl_input v[2] output declarations declare which output registers are written Example: dcl_output o[0].xyz dcl_output o[1] dcl_output o[2].xw resource declarations Example: dcl_resource t0, Buffer, UNORM dcl_resource t2, Texture2DArray, FLOAT code This Shader section contains executable instructions.
Section Contents
(back to chapter)
4.4.1 Overview
4.4.2 Elements in the Pipeline
4.4.3 Passing Elements Through Pipeline Interfaces
From the perspective of individual D3D11.3 Pipeline stages accessing and interpreting memory, all memory layouts (e.g. Buffer, Texture1D/2D/3D/Cube) are viewed as being composed of "Elements". An individual Element represents a vector of anywhere from 1 to 4 values. An Element could be an R8G8B8A8 packing of data, a single 8-bit integer value, 4 float32 values, etc. In particular, an Element is any one of the DXGI_FORMAT_* formats(19.1), e.g. DXGI_FORMAT_R8G8B8A8 (DXGI stands for "DirectX Graphics Infrastructure", a software component outside the scope of this specification which happens to own the list of DirectX formats going forward). Filtering may be involved in the process of fetching an Element from a texture, and this simply involves looking at multiple values for a given Element in memory and blending them in some fashion to produce an Element that is returned to the Shader.
Buffers in memory can be made up of structures of Elements (as opposed to being a collection of a single Element). For example a Buffer could represent an array of vertices, each vertex containing several elements, such as: position, normal and texture coordinates. See the Resources(5) section for full detail.
The concept of "Elements" does not only apply to resources. Elements also characterize data passing from one Pipeline stage to the next. For example the outputs of a Vertex Shader (Elements making up a vertex) are typically read into a subsequent Pipeline stage as input data, for instance into a Geometry Shader. In this scenario, the Vertex Shader writes values to output registers, each of which represents an individual Element. The subsequent Shader (Geometry Shader in this example) would see a set of input registers each initialized with an Element out of the set of input data.
There are various types of data interfaces in the hardware Pipeline through which Elements pass. This section describes the interfaces in generic terms, and characterizes how Elements of data pass through them. Specific descriptions for each of the actual interfaces in the Pipeline are provided throughout the spec, in a manner consistent with the principles outlined here. The overall theme here is that data mappings through all interfaces are always direct, without any linkage resolving required.
The first type of interface is Memory-to-Stage, where an Element from a Resource (Texture/Buffer) is being fetched into the some part of the Pipeline, possibly the "top" of the Pipeline (Input Assembler(8)), or the "side", meaning a fetch driven from within a Shader Stage. At the point of binding of memory Resources to these interfaces, a number is given to each Element that is bound, representing which input (v#) or texture (t#) "register" at the particular interface refers to the Element. Note that there is no linkage resolving done on behalf of the application; the Shader assumes which "registers" will refer to particular Elements in memory, and so when memory is bound to the interface, it must be bound (or declared, in cases where multiple Elements come from the same Resource in memory) at the "register" expected by the Shader.
For Memory-to-Stage interfaces, Elements always provide to the Shader 4 components of data, with defaults provided for Elements in memory containing fewer than 4 components (though this can be masked to be any subset of the 4 components in the Shader if desired).
For interfaces on the "side", where memory Resources are bound to Shader Stages so they can be fetched from via Shader code, the set of binding points (t# registers in the Shader) cannot be dynamically indexed within the Shader program without using flow control.
On the other hand, the interface at the "top" of the Pipeline (the input v# registers of the first active Shader Stage) can be dynamically indexed as an array from Shader code. The Elements in v# registers being indexed must have a declaration(22.3.30) specifying each range that is to be indexed, where each range specifies a contiguous set of Elements/v# registers, ranges do not overlap, and the components declared for each Element in a given range are identical across the range.
The second type of interface is Stage-to-Stage, where one Pipeline Stage outputs a set of 4 component Elements (written to output o# registers) to the subsequent active Pipeline Stage, which receives Elements in its input v# registers. The mapping of output registers in one Stage to input registers in the next Stage is always direct; so a value written to o3 always goes to v3 in the subsequent Stage. Any subset of the 4 components of any Element can be declared rather than the whole thing.
If more Elements or components within Elements are output than are expected/declared for input by the subsequent Stage, the extra data gets discarded / becomes undefined. If fewer Elements or components within Elements are output than are expected/declared for input by the subsequent Stage, the missing data is undefined.
Similar to the Memory-to-Stage interface at the "top" of the Pipeline, which feeds the input v# registers of the first active Pipeline Stage, at a Stage-to-Stage interface, writes to output Elements (o#) and at the subsequent Stage, reads from input elements (v#) can each be dynamically indexed as arrays from code at the respective Shaders. The Elements in o# registers being indexed must have a declaration(22.3.30) for each range, specifying a contiguous set of Elements/o# registers, without overlapping, and with the same component masks declared for each Element in a given range. The same applies to input v# registers at the subsequent stage (the array declarations for the input v# registers in the Shader are independent/orthogonal to the array declarations for o# in the previous Shader).
There is a detail which is mostly orthogonal to the the Stage-to-Stage interface discussion above: the frequency of operation at subsequent Stages varies, in addition to different amounts of data different Stages can input. For example the Geometry Shader(13) inputs all the vertices for a primitive. The Pixel Shader(16) can choose to have its inputs inperpolated from vertices, or take the data from one. The point of the above discussion is only to describe the mechanism for Element transport through the interfaces independently of these varying frequencies of operation between Stages.
The final type of interface is Stage-to-Memory, where a Pipeline Stage outputs a set of 4 component Elements (written to output o# registers) on a path out to memory. These interfaces (e.g. to RenderTargets or Stream Output) are somewhat the converse of the Memory-to-Stage Interface. Each memory Resource representing one or more Elements of output identifies each Element by a number #, corresponding directly to an output o# register. There is no linkage resolving done on behalf of the application; the application must associate target memory for Element output directly with each o# register that will provide it. Details on specifying these associations are unique for the different Stage-to-Memory interfaces (RenderTargets, Stream Output).
If a Stage-to-Memory interface outputs more Elements or components within Elements than there are destination memory bindings to accommodate, the extra data is discarded. If a Stage-to-Memory interface outputs fewer Elements or components within Elements than there are destination memory bindings expecting to be written, undefined data will be output (i.e. no defaults). At RenderTarget output, there are various means to mask what data gets output, most interesting of which is depth testing, but that is outside the scope of this discussion.
At the RenderTarget output interface (which is Pixel Shader(16) output), dynamic indexing of the o# registers is not supported. For the other Stage-to-Memory interface, Stream Output, indexing of outputs is permissible. Stream Output shares the output o# registers used for Stage-to-Stage output in the Geometry Shader(13) Stage, where indexing is permitted as defined for the Stage-to-Stage interface.
There are various hardware generated values which can each be made available when for input to certain Shader Stages by declaring them for input to a component of an input register. A listing of each System Generated Value in D3D11.3 can be found in the System Generated Value Reference(23), but in addition, here are links to descriptions of some (not all) of the System Generated Values: VertexID(8.16), InstanceID(8.18), PrimitiveID(8.17), IsFrontFace(15.12).
In the Hull Shader(10), Domain Shader(12) and Geometry Shader(13), PrimitiveID is a special case that has its own input register, but for all other cases of inputting hardware generated values into Shaders, (including the PrimitiveID into the Pixel Shader(16)), the Shader must declare a scalar component of one of its input v# registers as one of the System Generated Values to receive each input value. If that v# register also has some components provided by a the previous Stage or Input Assembler(8), the hardware generated value can only be placed in one of the components after the rest of the data. For example if the Input Assembler provides v0.xz, then VertexID might be declared for v0.w (since w is after z), but not v0.y. There cannot be overlap between the target for generated values and the target for values arriving from an upstream Stage or the Input Assembler.
Hardware generated values that are input into the generic v# registers can only be input into the first active Pipeline Stage in a given Pipeline configuration that understands the particular value; from that point on it is the responsibility of the Shader to manually pass the values down if desired through output o# registers. If multiple Stages in the pipeline request a hardware generated value, only the first stage receives it, and at the subsequent stages, the declaration is ignored (though a prudent Shader programmer would pass down the value manually to correspond with the naming).
Since VertexID(8.16), InstanceID(8.18) are both meaningful at a vertex level, and IDs generated by hardware can only be fed into the the first stage that understands them, these ID values can only be fed into the Vertex Shader. PrimitiveID(8.17) generated by hardware can only be fed into the Hull Shader, Domain Shader, as well as whichever of the follwing is the first remaining active stage: Geometry Shader or Pixel Shader.
It is not legal to declare a range of input registers as indexable(22.3.30) if any of the registers in the range contains a System Generated Value.
From the API point of view, System Generated Values and System Interpreted Values (below) may be exposed to developers as just once concept: "System Values" "SV_*".
In many cases, hardware must be informed of the meaning of some of the application-provided or computed data moving through the D3D11.3 Pipeline, so the hardware may perform a fixed function operation using the data. The most obvious example is "position", which is interpreted by the Rasterizer (just before the Pixel Shader). Data flowing through the D3D11.3 Pipeline must be identified as a System Interpreted Value at the output interface between Stages where the hardware is expected to make use of the data. For the case where the Input Assembler(8) is the only Stage present in a Pipeline configuration before the place where the hardware is expected to interpret some data, the Input Assembler(8) has a mechanism for identifying System Interpreted Values to the relevant (components of) Elements it declares.
A listing of each System Interpreted Value in D3D11.3 can be found in the System Interpreted Values Reference(24). Each System Interpreted Value has typically one place in the Pipeline where it is meaningful to the hardware. Also, there may be constraints on how many components in an Element need to be present (such as .xyzw for "position" going to the Rasterizer).
If data produced by the Input Assembler or by the output o# registers of any Stage is identified as a System Interpreted Value at a point in the pipeline where the hardware has no use for interpreting the data, the label is silently ignored (and the data simply flows to the next active Stage uninterpreted). For example if the Input Assembler labels the xyzw components of one of the Elements it is producing as "position", but the first active Pipeline Stage is the Vertex Shader, the hardware ignores the label, since there is nothing for hardware to do with a "position" going into the Vertex Shader.
Just because data is tagged as a System Interpreted Value, telling hardware what to do with it, does not mean the hardware necessarily "consumes" the data. Any data flowing through the Pipeline (System Interpreted Value or not) can typically be input into the next Pipeline Stage's Shader regardless of whether the hardware did something with the data in between. In other words, output data identified as a System Interpreted Value is available to the subsequent Shader Stage if it chooses to input the data, no differently from non-System Interpreted Values. If there are exceptions, they would be described in the System Interpreted Value Reference(24). One catch is that if a given Pipeline Stage, or the Input Assembler, identifies a System Interpreted Value (e.g. "clipDistance"), and the next Shader Stage declares it wants to input that value, it must not only declare as input the appropriate register # and component(s), but also identify the input as the same System Interpreted Value (e.g. "clipDistance"). Mismatching declarations results in undefined behavior. e.g. Identifying an output o3.x as "clipDistance", but not naming a declared input at the next stage v3.x as "clipDistance" is bad. Of course, in this example it would be legal for the subsequent Shader to not declare v3.x for input at all.
It is not legal to declare a range of input or output registers as indexable(22.3.30) if any of the registers in the range contains a System Interpreted Value, with the exception of System Interpeted Values for the Tessellator, which have their own indexing rules - see the Hull Shader(10) specification.
Note that there is no mechanism in the hardware to identify things that the hardware does not care about, such as "texture coordinate" or "color". At a high level in the software stack, full naming of all data may or may not be present to assist in authoring and/or discoverability, but these issues are outside the scope of anything that hardware or drivers need to know about.
Note that while it may seem redundant to label System Interpreted Values at both the place producing the values as well as the next stage inputting it (in the case where the next stage actually wants to input it), this helps hardware/drivers isolate the compilation step for Shader programs at different Stages from any dependency on each other, in the event the driver needs to rename registers to fit hardware optimally, in a way that is transparent to the application.
From the API point of view, System Generated Values and System Interpreted Values (above) may be exposed to developers as just once concept, "System Values" "SV_*".
In many cases in D3D11.3, an offset for an Element is required, a stride for a structure (e.g. vertex) is required, or an initial offset for a Buffer is required. All of these types of values have the following alignment restrictions:
Example byte alignments for some of the formats(19.1) which can be used in structures (e.g. vertex buffers) or as elements in index buffers:
However, these alignment rules do not apply to Buffer offsets when creating Views on Buffers. These Buffer offsets have more stringent requirements, detailed in the View section(5.2).
There is also some similar discussion, focused on memory accesses common to UAVs(5.3.9), SRVs and Thread Group Shared Memory in the Memory Addressing and Alignment Issues(7.13) section.
None of these rules are validated (except in debug mode) and violations will result in undefined behavior.
Chapter Contents
(back to top)
5.1 Memory Structure
5.2 Resource Views
5.3 Resource Types and Pipeline Bindings
5.4 Resource Creation
5.5 Resource Dimensions
5.6 Resource Manipulation
5.7 Resource Discard
5.8 Per-Resource Mipmap Clamping
5.9 Tiled Resources
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
Several different Resource Types (arrangements of memory storage) are available for input or output by various Pipeline stages. The available Resource Types are: Buffer(5.3.4) (Typically a Structured(5.1.3) or "Unstructured(5.1.2) region of memory), Texture1D(5.3.5) (Homogeneous array of 1D Textures), Texture2D(5.3.6) (Homogeneous array of 2D Textures), Texture3D(5.3.7) (Volume Texture), and TextureCube(5.3.8) (3D enclosure). The Resource Type, in general, determines many characteristics, like whether the memory is Structured(5.1.3), where the Resource may be bound to in the graphics pipeline, how many mip levels there are, what the sampling behavior is, and other possible restrictions/properties on the Resource. Resources are built up of one or more Subresources, which each are a generalized 3D quantity of data which degenerates to store 2D and 1D quantities of data. The arrangement of Subresources to build up a Resource is tied to the Resource Type and dimensions.
There are also distinctions in how a Resource is bound to the graphics pipeline. The binding location can also be thought of as accepting either Buffers directly or accepting Views of Resources. Each binding location which accepts Views requires a unique View type for that location - e.g. Render Target View or Shader Resource View.
The size for mipmap slice subresources 1..n are computed sequentially from the size of the largest subresource (subresource 0, where for each mipped dimension:
mipslice N+1 size = floor( mipslice N size / 2)
The following diagram depicts Resources, their Subresource arrangement, and how they are sampled from within shaders. While the following diagram depicts deep mip mapping, it is valid to create Resources less than the maximum amount of mip levels.
Section Contents
(back to chapter)
5.1.1 Overview
5.1.2 Unstructured Memory
5.1.3 Structured Buffers
5.1.4 Raw Buffers
5.1.5 Prestructured+Typeless Memory
5.1.6 Prestructured+Typed Memory
When a Resource is allocated, it's memory structure can generally be classified either as Unstructured, Prestructured+Typeless, or Prestructured+Typed.
Only the Buffer Resource(5.3.4) construction may be created as "Unstructured". Unstructured identifies the Resource as a single contiguous block of memory with no mipmaps, nor array slices. Unstructured Resources generally must have the memory structure defined when the Resource is bound to the graphics pipeline (providing types and offsets for the Element(s) in the Resource, as well as an overall stride). This memory structure can change freely, since it is late-bound to the Resource at the graphics pipeline binding location.
The same Unstructured Resource may be bound to multiple slots in the graphics Pipeline with different memory interpretations at each location, as long as the Resource is only being read from at each binding. The same Unstructured Resource may not be bound to read and write stages of the pipeline simultaneously for a single Draw/Dispatch operation.
Unstructured Resources do not have mipmaps nor array slices. See the Resource Binding Table(5.3.1) for descriptions of where Buffers (the only Resources that can be Unstructured) can be bound in the Pipeline.
Only the Buffer Resource(5.3.4) construction may be created as "Structured". Structured identifies the Resource as a single contiguous block of memory with no mipmaps, nor array slices, but it does have a structure size (stride), so that it represents an array of structures. Implementations can take advantage of knowing there is a fixed structure size in they way they lay out the memory physically (hidden from the application).
A number of application scenarios require the ability to write a structure of data out to an index in an array. E.g. Generating an unordered collection of output data in an Append buffer(5.3.10). Hardware may be optimized for smaller reads and writes than the stride of a data. Consider a group of 16 shader threads where each thread wants to write out the first 4 bytes of a structure. If the structure is only 4 bytes, the 16 threads will collectively write out 16 consecutive 32-bit locations, which tends to be fast. But if the structure is larger – say 64 bytes, then the 16 threads will each issue a write that is spaced 64 bytes apart. Then when reading the data back in a later pass, the same problem will be reoccur. Reads will be issued with a spacing equal to the stride of the structure, with larger structures likely to have more of a performance issue.
Due to the reads and the writes having similar access patterns it would be better to have the data layout in memory match the access pattern that occurs. Since the actual access pattern is hardware specific as well as the performance characteristics of reads spaced by stride boundaries, the design pattern of textures is followed to allow for better performance by hiding the physical layout of the memory.
The same Structured Resource may be bound to multiple slots in the graphics Pipeline, as long as the Resource is only being read from at each binding. The same Structured Resource may not be bound to read and write stages of the pipeline simultaneously for a single Draw/Dispatch operation.
Structured Resources do not have mipmaps nor array slices. See the Resource Binding Table(5.3.1) for descriptions of where Buffers (the only Resources that can be Structured) can be bound in the Pipeline.
Sometimes a convenient way to access the contents of a Buffer is to treat it simply as a huge bag of bits. The Raw view comes close to this, by allowing access to a Buffer in the form of 32-bit aligned addressing and accessing of data in chunks of 1-4 32-bit values, with no type.
Raw access to a Buffer is indicated when creating either a Shader Resource View(5.2) (SRV) or Unordered Access View(5.3.9) (UAV), with the flag D3D11_BUFFER_SRV_FLAG_RAW (SRV) or D3D11_BUFFER_UAV_FLAG_RAW (UAV).
To be able to create a RAW View, the underlying resource had to have been created with D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS.
This flag cannot be combined with D3D11_RESOURCE_MISC_STRUCTURED_BUFFER. Also, a Buffer created with D3D11_BIND_CONSTANT_BUFFER cannot also specify D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS. This is not a limitation, since Constant Buffers already have a constraint that they cannot be accessed with any other View in the first place.
Other than those invalid cases, specifying D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS when creating a Buffer does not limit any functionality versus not having it – e.g. the Buffer can be used for non-RAW access in any number of ways possible with D3D. Specifying the D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS flag only increases available functionality – it is just giving the system an early indication that the Buffer may participate in RAW style access in addition to other uses.
Any Resource type may be created as "Prestructured+Typeless". A structure size is provided, plus bit widths of components (but not the types of those components), and also dimensions (in units of structures) appropriate for the Resource type. This is unlike a Structured Buffer, which only specifies a structure size/stride and no definition of the contents of the structure. Before the Resource is bound to the pipeline, Resource Views must be created which will fully qualify the component's types. These Resource Views also allow the Resource to be decomposed into smaller compatible subgroupings of the Subresources. For example, a fully mipped DXGI_FORMAT_R32G32B32A32_TYPELESS Texture3D with a width of four, a height of three, and a depth of five, would have three mip levels. To use this texture, a Resource View would have to fully qualify the format of the Resource, possible to DXGI_FORMAT_R32G32B32A32_UINT. In addition, the Resource View could also regroup only the two least detailed mip levels or select only a particular mip level. This allows the original Resource to be manipulated as if it were a Resource made up of only a few Subresources within the original Resource. The full details of Resource Views(5.2) is described later.
The benefit of Prestructured+Typeless Resources is that memory may be used as weakly typed storage, enabling limited reuse or reinterpretation of the memory, as long as the component bit counts remain the same. The same Prestructured+Typeless Resource may be bound to multiple slots in the graphics pipeline with Views of different fully qualified formats at each location. This forces bit representations of formats to be well-defined with respect to each other.
For example, a Resource created with the format R32G32B32A32_TYPELESS may be used as R32G32B32A32_FLOAT and R32G32B32A32_UINT at different locations in the pipeline simultaneously.
Any Resource type may be created as "Prestructured+Typed", also known as creating the Resource with a fully-qualified type or format. In general, this may allow Resource optimizations, especially when the Resource is created with flags indicating that the Resource cannot be Mapped/ Locked by the application.
Special resource formats, such as Block Compression Formats(19.5), have the characteristic that in order to read an individual Element in the resource, there is not a unique location in the resource that corresponds to the Element. Some sort of decompression or decoding of data from locations in the resource that are not unique to a particular Element is required during the read process in order to resolve what an individual Element is (even when no filtering is being applied). Complex formats like this must be created as part of a "Prestructured+Typed" resource.
"Prestructured+Typed" and "Prestructured+Typeless" resources support mipmapping, as the combination of Resource type, dimensions and structure size provided during resource creation supply enough information to allocate all memory in the layout required. Additionally, Resource Views created against Prestructured+Typed Resources must have indentical Resource Formats as the Prestructured+Typed Resource.
Section Contents
(back to chapter)
5.2.1 Overview
5.2.2 Shader Resource View Support for Raw and Structured Buffers
5.2.3 Clearing Views
In order to indirectly bind a Resource to certain stages of the graphics pipeline, Resource Views must be used. In addition, since some Resources may be created as "Prestructured+Typeless", the View provides the final opportunity to fully qualify the Resource component's types. The Resource Views also allow the Resource to be decomposed into smaller compatible subgroupings of the Mip Slices, Array Slices, and Subresources. This means that the effective dimensions and array sizes of the Views will, naturally, always be less than or equal to the original Resource. Each stage of the pipeline requires a unique type of View, and each type of View may have it's own custom set of state parameters that are needed to complete the process of binding a particular Resource to the graphics pipeline stage. All necessary restrictions to the basic Resource have already been done through the Pipeline Bind Flags during Resource creation. These Resource Views are directly bound to the pipeline, instead of the Resource objects, themselves.
A resource view is distinct from the underlying resource from which the view was created, so where views are used, the view properties (number of mipmaps, number of array elements, type, etc.) are always used in place of the properties of the original resource. Thus, for example, a render target array index of zero always indicates the first array element in the view, even if the first array element in the view is not the first array element in the underlying resource. Out of range behaviors are also always with respect to the view properties where views are used.
Each unique View type has certain restrictions associated with the bind location of the graphics pipeline stage. For example, Render Target Views of Buffers may have a maximum width of 16384. This maximum is smaller than the maximum size of a Buffer (min(max(128,0.25f * (Amount of Dedicated VRAM)),2048) MB), so only a subsection of large Buffers may be bound as a Render Target at a time. In addition, Render Target Views of Texture3D may have a maximum array size of 2048. This fortunately matches the maximum W dimension size of a Texture3D (2048).
When Views are created of Buffers, restrictions are placed on the View's starting offset in the Buffer. If represented as a byte offset, the offset must be a multiple of the View Element Size. Another way to comply with this restriction is by specifying the Buffer offset in an integral number of View Elements. In addition, there exists another restriction on Buffer View creation. Views of the R32G32B32 element type cannot be created on a Buffer which had the Pipeline Bind flag of IAVERTEXINPUT, IAINDEXINPUT, CONSTANTBUFFER, or STREAMOUTPUT set. This prevents an R32G32B32 element from being used simultaneously as vertex and texture data.
To characterize the kind of decomposition that Shader Resource Views are capable of, here's a complete listing of the number of Views that are possible with a Texture2D Resource that was created fully mipped with the most detailed LOD: width = 4, height = 4, arraysize = 3.
The Views bound at the Render Target, Depth Stencil and Unordered Access binding locations in the pipeline have futher restrictions, in that they can only choose a Mip Slice, aka. select only one mip level. Here's a listing of the possible decomposition that can occur with Render Target, Depth Stencil and Unordered Access Views of the same Resource used in the previous example:
The following DDIs indicate the way Shader Resource Views (SRVs) are created, allowing read-only access to Raw and Structured Buffers in any shader stage.
Making an SRV of a Raw buffer allows it to be declared for read in any shader stage by the ld_raw instruction. This is accomplished by specifying a flag on creation of the Buffer View requesting Raw access (D3D11_DDI_BUFFEREX_SRV_FLAG_RAW) shown below.
In contrast, if the underlying Buffer was created as a Structured Buffer, then any SRV of the Buffer inherits the Structured semantics. In this case all shader stages can declare the resource for read by the ld_structured instruction. Note that unlike _RAW views (where the View decides that the Buffer will be "viewed" as RAW), nothing about the creation of a View of a Structured Buffer needs to indicate that it is structured, because once the Structured property is assigned to a Buffer on creation of the resource (including a structure stride), all Views on the Buffer are automatically Structured.
typedef struct D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW { UINT FirstElement; UINT NumElements; } D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW; // BufferEx - Ex means extra pararameters typedef struct D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW { UINT FirstElement; UINT NumElements; UINT Flags; // See D3D11_DDI_BUFFEREX_SRV_FLAG* below } D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW; #define D3D11_DDI_BUFFEREX_SRV_FLAG_RAW 0x00000001 typedef struct D3D11DDIARG_CREATESHADERRESOURCEVIEW { D3D11DDI_HRESOURCE hDrvResource; DXGI_FORMAT Format; D3D11DDIRESOURCE_TYPE ResourceDimension; union { D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW Buffer; D3D11DDIARG_TEX1D_SHADERRESOURCEVIEW Tex1D; D3D11DDIARG_TEX2D_SHADERRESOURCEVIEW Tex2D; D3D11DDIARG_TEX3D_SHADERRESOURCEVIEW Tex3D; D3D11DDIARG_TEXCUBE_SHADERRESOURCEVIEW TexCube; D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW BufferEx; }; } D3D11DDIARG_CREATESHADERRESOURCEVIEW;
Clearing is an optimized operation to enable filling Render Target, Depth Stencil and Unordered Access Views with certain clear values.
The floating point values passed in through the DDI must be converted to the fully qualified format type of the View desired to be cleared. The standard type conversion rules(3.2) indicate how to convert to most values; but these conversion rules do not explicitly handle the case where the destination fixed point format contains more integer bits than the floating point format mantissa. When converting these floating point values to a format such as DXGI_FORMAT_R32G32B32A32_UINT or _SINT, the closest value is chosen. When the original floating point absolute value is larger than 2^24, the least significant bits of the destination are to be filled with 0's for _UINT and positive _SINT; or 1's for negative _SINT values.
The full extent of the resource view is always cleared. Viewport and scissor are not applied.
Depth clear values outside of the range specified in viewport range(15.6.1) will not be passed to the DDI.
// part of user mode Device interface: STDMETHOD_( void, ClearRenderTarget )( D3D10DDI_HDEVICE hDevice, D3D11DDI_HRENDERTARGETVIEW hRenderTargetView, FLOAT ColorRGBA[ 4 ] ); STDMETHOD_( void, ClearDepthStencil )( D3D10DDI_HDEVICE hDevice, D3D11DDI_HDEPTHSTENCILVIEW hDepthStencilView, UINT DSFlags, FLOAT Depth, UINT8 Stencil );
For UnorderedAccessViews(5.3.9), there are a couple of ways to Clear the View.
ClearUnorderedAccessViewUint(...) clears a UAV with bit-precise values, copying the lower ni bits from each array element i to the corresponding channel, where ni is the number of bits in the ith channel of the resource Format (for example, R8G8B8_FLOAT has 8 bits for the first 3 channels). This works on any UAV with no format conversion. For RAW Buffer and Structured Buffer Views, only the first array element’s value is used.
ClearUnorderedAccessViewFloat(...) clears a UAV with a float value. It only works on FLOAT, UNORM, and SNORM UAVs, with format conversion from FLOAT to *NORM where appropriate. On other UAVs, the operation is invalid and the call will not reach the driver.
// part of user mode Device interface: STDMETHOD_( void, ClearUnorderedAccessViewUint)( D3D10DDI_HDEVICE hDevice, D3D11DDI_HRENDERTARGETVIEW hRenderTargetView, UINT Values[ 4 ] ); STDMETHOD_( void, ClearUnorderedAccessViewFloat)( D3D10DDI_HDEVICE hDevice, D3D11DDI_HDEPTHSTENCILVIEW hDepthStencilView, FLOAT Values[ 4 ] );
View clearing command, implemented however the driver sees is the most efficient way. The primary distinction here versus the other Clears described above in D3D11 is that this takes a list of rects (an empty list clears the entire surface). This method only works on RTV, UAV, or any Video View of a Texture2D surface (runtime drops invalid calls). All array slices in the view get the same clear applied (any rects apply to each array slice).
The driver or hardware is responsible for clamping rects to the surface extents.
Color values are converted/clamped to the destination format as appropriate per D3D conversion rules. E.g. if the format of the view is R8G8B8A8_UNORM, inputs are clamped to 0.0f to 1.0f (NaN to 0).
If the format is integer, such as R8G8B8A8_UINT, inputs are taken as integral floats, so 235.0f maps to 235 (fractions rounded to zero, out of range/INF values clamped to target range, NaN to 0).
typedef VOID ( APIENTRY* PFND3D11_1DDI_CLEARVIEW )( D3D10DDI_HDEVICE hDevice, D3D11DDI_HANDLETYPE viewType, // View type that supports this clear // (RTV, UAV or any Video view). // Must be a Texture2D{Array} resource only VOID* hView, const FLOAT[4] color, // interpretation of color is view / format specific const D3D10_DDI_RECT* pRect, // Rect is subject to aligment constraints based on format being cleared. // e.g. Subsampled video formats require rect extents snapped to full sample boundary // NULL means clear the entire view. UINT numRects );
Color Mappings for RTVs and UAVs: Color[0]: R Color[1]: G Color[2]: B Color[3]: A (e.g. An RTV of the Y plane of an NV12 surface, of format R8_*, would take the color from R. An RTV of the UV plane of an NV12 surface, of format R8G8_*, would take the color from RG.) Color Mappings for Video Views: Color[0]: Y Color[1]: U/Cb Color[2]: V/Cr Color[3]: A
For Video Views with YUV or YCbBr formats, no color space conversion happens – and in cases where the format name doesn’t indicate _UNORM vs. _UINT etc., _UINT is assumed (so input 235.0f maps to 235 as described above).
This feature is required to be supported for all D3D10+ hardware in D3D11.1 drivers and for D3D9 drivers maps to the already existing functionality there. The D3D9 equivalent honored the scissor rect, so emulation of ClearView on the D3D9 DDI will unset scissor / clear / reset scissor to achieve the intended behavior of ClearView (e.g. this scissor manipulation isn't needed on the new D3D11.1 ClearView DDI which ignores scissor/viewports by definition.).
Having this Clear with rects provides parity with D3D9 where there was a similar Clear that in particular was used for video. With Video added to D3D11 (outside the scope of this spec), adding this ClearView provides parity with D3D9.
Direct2D will be another user of this for rendering scenarios that map to a fill.
For RTVs and UAVS: The space the ClearView rects apply on is that of the view format (as opposed to the surface format, which for video surfaces can be different sizes). This is consistent with how Viewports and rendering work on those views. e.g. for a 64x64 YUYV surface, an RTV with the format R8G8B8A8_UINT appears in shaders (and to RSSetViewports()) as having dimensions 32x64 RGBA values. ClearView’s rects apply to the same space. The “color” coming into ClearView is just maps to the channels in the view (RGBA) ignoring the video layout. So a single clear color could really mean “stripes” of color if interpreted in video space. That’s not interesting to do, but it just falls out and isn’t worth bothering to validate out – the user who makes D3D views of video surfaces has to know they are operating on the raw memory via D3D – be it shaders or APIs like ClearView.
By contrast, ClearView on Video Views (the views that are used with the video pipeline and not D3D Rasterization) operate on logical surface dimensions. So a 64x64 YUYV surface appears as though it is that size, and so rects passed into ClearView are in that full 64x64 space (not 32x64). It is undefined to request clearing non-aligned rects (covering only half of the pixel pairs). The color passed into ClearView is just a single YUV value that is appropriately replicated for subsampled pixels by the driver. Video Views hide the memory layout from the API user, so they do not have to worry about what type of subsampling is going on (an exception is the alignment of the rect bounds).
Section Contents
(back to chapter)
5.3.1 Overview
5.3.2 Performant Readback
5.3.3 Conversion Resource Copies/ Blts
5.3.4 Buffer
All Resources must be qualified with a set of Pipeline Bind flags at creation time to indicate where in the graphics pipeline the Resource may be bound. Binding a Resource at a certain pipeline location imposes certain restrictions on the Resource for it's entire lifetime. Naturally, Resources may be bound at more than one location in the pipeline (even simultaneously within certain restrictions), but the Resource must satisfy all the restrictions that each Pipeline Bind flag imposes. Certain pipeline locations only accept Resource Views(5.2) to be bound to them. In such a case, the presence of the Pipeline Bind flag indicates that Resource Views can be created against the Resource in order to bind the Resource to such a pipeline location. Sometimes Pipeline Bind flags impose restrictions which conflict with each other, so such Pipeline Usage flags are naturally mutually exclusive. Otherwise, explicit mention is given when one Pipeline Bind flag prevents the usage of other Pipeline Bind flags.
The following table indicates which Resource Types may be bound to which available graphics Pipeline locations. A single entire Resource may not be able to have itself bound entirely to both an input and output Pipeline stage during a Draw operation. However, it is possible to refer to discrete components of the Resource, with Resource Views(5.2), allowing the same Resource to be bound as an input and output simultaneously, as long as the different Views do not share the same Subresources. For example: A two-dimensional mipped Resource created with the appropriate Pipeline Bind flags may have Subresources bound as Shader Resource Inputs, and a mutually exclusive Subresource from the same Resource bound as a RenderTarget Output, by using different Views.
Resource Type | Input Assembler Vertex or Index | Shader Resource Input | Shader Constant Input | Stream Output | RenderTarget Output | Depth/ Stencil Output |
---|---|---|---|---|---|---|
Buffer | U | V | U | U | V | |
Texture1D | V | V | V | |||
Texture2D | V | V | V | |||
Texture3D | V | V | ||||
TextureCube | V | V | V |
Any Resource that is used as an output for the graphics pipeline cannot be mapped/ locked. This is not meant to block an application from viewing the contents of such a Resource. It is expected that to read the contents of such Resources in a performant manner, the contents must be copied to a Resource which is able to be mapped/ locked for CPU read access. Typically, the Resource which is able to be mapped/ locked will not be marked with any Pipeline Bind flags, and as such is expected to be a driver allocated system memory Resource which is allocated in such a fashion to be compatible with the hardware DMA engine. The Resource is also expected to be allocated for performant CPU reads. This enables an asynchronous performant read back for the CPU.
The Performant Readback(5.3.2) scenario highlights the need that for any device-dependent memory arrangement, used to optimize GPU Resources which cannot be mapped/ locked, there is always a performant ability to convert the memory arrangement into the device-independent memory arrangement that will be used to satisfy the map/ lock. This principle also relates to input Resources that cannot be mapped/ locked. Since non-mappable/ non-lockable input Resources may use a device-dependent memory arrangement and still be updated with UpdateSubresourceUP(5.6.8), CopyResource(5.6.3), and CopySubresourceRegion(5.6.2). Therefore, there is a need for a performant ability to convert the device-indepenedent memory arrangement into any device-dependent memory arrangement.
The Buffer is the only Resource which can be created as Unstructured(5.1.2). When the Buffer is bound to the graphics Pipeline, it's memory interpretation generally must also be bound to the graphics Pipeline along with it (providing types and offsets for the Element(s) in the Resource, as well as an overall stride). Sometimes this information is bound or described separately.
A Buffer has neither multiple mip levels nor multiple array slices, so a Buffer is made up of only a single Subresource. Buffers can be bound at multiple places in the pipeline simulatenously during a Draw call as long the Buffer is only read from at each location. If the Buffer is being written to, then the Buffer may only be bound to one location in the pipeline during a Draw call.
When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as an Input Assembler Vertex Input, the Buffer may be contain multiple types of data per vertex. This data type, offset, and stride binding is done when the Resource is bound to the Pipeline.
When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as an Input Assembler Index Input, and the Buffer is bound as an Index Input, at the time of binding, the format must be specified as one of: R16_UINT, or R32_UINT.
When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as a Shader Constant Input, the format of the Buffer is assumed to be R32G32B32A32_TYPELESS when bound as a Shader Constant Input. The Buffer size viewable from a shader is restricted to hold a maximum of 4096 elements. The overall buffer size can be larger - see Offsetting Constant Buffer Bindings(5.3.4.3.2). The usage of Constant Buffers within the shaders is expected to make Shader execution more efficient than using ld(22.4.6) or sample(22.4.15) with a Shader Resource within the Shader. Constant Input is read into a Shader given an integer array index to fetch a single Element. This is similar to point sampling of a texture; as there is no filtering. Constant Input is only needed to store Shader constants which could change between Draw() calls, as opposed to Immediate Constants or an Immediate Constant Buffer, which is are embedded into a Shader.
A Shader Constant Resource is expected to be optimized for moving constant data from the CPU to the graphics adapter, and as such, may not be able to be mapped/ locked, allowing the CPU to read the contents of the Buffer directly. Therefore, the Resource may only be CPUWRITE (write-only) or not mappable/ lockable. In addition, if the Resource is mappable/ lockable, Map/ Lock must be called with DISCARDRESOURCE. NOOVERWRITE is not valid on Shader Constant Resources either. The Resource may still be used with CopyResource(5.6.3) and CopySubresourceRegion(5.6.2). All other Pipeline Bind flags are prevented from being used, disallowing constant buffers to be vertex buffers, streamed out to or rendered to, etc.
Map() allows NO_OVERWRITE for Constant Buffers. This was disallowed before D3D11.1.
Similarly, UpdateSubresource1() adds the ability to perform partial Constant Buffer updates. So the pDstBox parameter does not have to be null NULL when updating Constant Buffers via UpdateSubresource1(). Either NO_OVERWRITE or DISCARD flags must be specified for a partial update, and the extents of the pDstBox parameter must be aligned to 16 byte (full constant) boundaries or the call is dropped.
Before the first call with NO_OVERWRITE on a deferred context, a DISCARD must be done on the same context (via Copy*()/Update*()/Map() API flag or Discard*() API). This is not required on immediate contexts if the application knows the GPU is finished with the resource (though discard can be used if not).
This feature is required to be supported for all D3D10+ hardware with D3D11.1 drivers.
This allows applications to partially go back to a DX9 style convention where they have the ability to set invidivual constants in a Constant Buffer if they like (albeit with the new simplifying NO_OVERWRITE limitation - the updates can't conflict with existing constant references that may be in flight on the GPU). The restriction to not allow partial Constant Buffer updates when Constant Buffers were added to D3D10 was intended to simplify the system handling of shader constants on the assumption that applications could simply organize their constant data in to groups, each with its own Constant Buffer, organized by frequency of update. The impression seems to be that in many cases this restriction was a net performance loss for applications, hence this proposed change to at least partially loosen up Constant Buffer updates.
A common desire for high performance game engines is to collect a large batch of Constant Buffer updates for constants to be referenced by separate Draw*() calls, each needing their own constants, all at once. This is facilitated by allowing the application to create a large Buffer and then pointing individual shaders to regions within it (kind of like a View, but without having to make a whole object to describe the view).
Constant Buffers are allowed to be created larger than the maximum Constant Buffer size that an individual shader can reference, which is at most 4096 16-byte elements - 65kB. Each "element" is one 4-component Shader Constant.
The Constant Buffer Resource size is limited only by the size of memory allocation the system is capable of handling (limits defined elsewhere, and more than large enough for the purpose of the discussion here).
When a Constant Buffer larger than 4096 elements in size is bound to the pipeline via *SetShaderConstants() APIs [e.g. VSSetShaderConstants()], it appears to the shader as if it is only 4096 elements in size.
Variants of the *SetShaderConstants() APIs, *SetShaderConstants1() allow a "FirstConstant" and "NumConstants" to be specified along with the binding. When the shader accesses a Constant Buffer bound this way it will appear as if it starts at the specified "FirstConstant" offset (where 1 means 16 bytes) and has a size defined by NumConstants (number of 16 byte Constants). This is basically a lightweight "View" of a region of a larger Constant Buffer.
FirstConstant must be a multiple of 16 constants.
NumConstants must be a multiple of 16 constants, in the range [0..4096].
If any part of the range defined by FirstConstant and ConstantCount falls off the underlying resource, accesses to those addresses count as out of bounds reads from the shader, which is defined to return 0 for all components.
This feature is required to be supported for all D3D10+ hardware in D3D11.1 drivers and is emulated by the runtime on Feature Level 9_x running on D3D9 drivers.
When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input and it is a typed Buffer (the view specifies a format type), it may be read from within shaders with the load(22.4.6). See the description of this instruction for detail. To use a typed Buffer as a Shader Resource Input, it must be bound at one of the available 128 slots for input Resources, by first creating the appropriate View for this particular stage of the graphics pipeline. It is fine for the same Buffer to be bound to multiple slots simultaneously, possibly even with different Element formats or inital offsets. However at each binding, only a single Element type is permitted, and the data stride is implied to be equal the Element size. In other words, "Array-of-structure" style layouts cannot be described for typed Buffers bound at Shader Resource Input. Structured Buffers allow array-of-structures access, though without any automatic format conversion for elements.
Just like Typed Buffers, Raw and Structured Buffers can be bound to the pipeline via Shader Resource Views for reading into shaders via ld_raw(22.4.10) and ld_structured(22.4.12) instructions, respectively.
Details of the usage of such a Resource are described in the Streaming Output section(14). There are two types of bindings available for Stream Output Buffers, one that treats a single output Buffer as a Multiple-Element Buffer (array-of-structures), while the other permits multiple output Buffers each treated as Single-Element Buffers (structure-of-arrays). Single-Element Buffer output is expected to be used typically for recirculation (subsequently) as a Shader Resource Input, but this can also be used as Input Assembler Vertex Input. Multiple-Element Buffer output is only intended to be used for recirculating data (subsequently) back as Input Assembler Vertex Input (since Multiple-Element Buffer access is not currently available in Shaders).
If the Resource has the Input Assembler Vertex Input Pipeline Bind flag specified, the Resource may also be used with DrawAuto(8.9).
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as a RenderTarget Output, this Pipeline Bind flag indicates that Render Target Views may be created with this Resource.
Constraints when a Buffer is used as RenderTarget output: it cannot be paired with any Depth/Stencil Output (i.e. no depth buffering); it can only have a single Element defined, with a data stride implied to be equal to the Element width; the View is limited to a maximum width of 16384 (multiple Views with different offsets would be needed to leverage the entire Buffer). In all other regards, a Buffer render target output is identical to the Texture1D case.
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
When the Unordered Access Pipeline Bind has been indicated, Unordered Access Views may be created for use at the Compute Shader or Pixel Shader.
A Texture1D is a homogeneous array of 1D Textures. The array is homogeneous in the sense that each Texture has the same data format and dimensions (including miplevels). The entire array of Textures are created atomically. The memory for the entire Resource need not be contiguous. A Texture1D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a Texture1D may be decomposed into sub-groups of Mip Slices, Array Slices, and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.
Like other Resources, a Texture1D must be qualified with a set of flags at creation indicating where in the graphics pipeline the Resource may be bound. Naturally, the Resource may be bound at more than one location in the pipeline, but the Resource must've been created with the restrictions that each Pipeline Usage flag indicates. Sometimes Pipeline Bind flags have restrictions which conflict with each other, so such Pipeline Bind flags are mutually exclusive.
When the Texture1D has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the Texture1D Resource may be read from within shaders with the ld(22.4.6) or sample(22.4.15) instructions, after they are bound to the pipeline through the usage of Views. See the descriptions of these instructions for details. Each Element from a Texture1D to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). Texture1D Resources are addressed from the Shader with a 1D coordinate plus a 2nd coordinate specifying which Array Slice in the Texture1D to fetch from. The 2nd coordinate, if provided as floating point data, is rounded (nearest even), producing an integral array index. Typical 1D filtering occurs on the Array Slice chosen by the 2nd coordinate.
When a Texture1D Mip Slice is bound as a RenderTarget Output, through the usage of Views, it is allowable to use either an accompanying Texture1D Depth/ Stencil of the same dimensions. For example, if the most detailed Mip Slice View of a Texture1D (width=6, arraysize=8) is bound as a RenderTarget Output; an effective Texture1D View of (width=6, arraysize=8) may be used as a Depth/ Stencil. Also, the particular Array Slice in the Texture1D to render is chosen, from the Geometry Shader stage, by declaring a scalar component output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render to Array Slice 0.
Rasterization to Texture1D resources is identical to rasterizing to a Texture2D resource with a y dimension of 1, thus both x and y coordinates are honored and only rendering that covers the Nx1 area of these resources will update them.
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
When the Texture1D has been created with the Pipeline Bind flag indicating that it may be used as a Depth/ Stencil Output, the Texture1D Resource may only be one of a few Resource Formats (essentially only those which have a 'D' component or those TYPELESS formats which can be converted to a format with a 'D' component), such as D32_FLOAT or R32_TYPELESS, etc.
Resources created with this Pipeline Bind flag cannot also be used as a RenderTarget (the two flags are mutually exclusive).
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. In addition, Depth/ Stencil Resources cannot be a destination for CopyResource(5.6.3), CopySubresourceRegion(5.6.2), nor UpdateSubresourceUP(5.6.8) operations. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
A Texture2D is a homogeneous array of 2D Textures. The array is homogeneous in the sense that each Texture has the same data format and dimensions (including miplevels). The entire array of Textures are created atomically. The memory for the entire Resource need not be contiguous. A Texture2D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a Texture2D may be decomposed into sub-groups of Mip Slices, Array Slices, and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.
Like other Resources, a Texture2D must be qualified with a set of flags at creation indicating where in the graphics Pipeline the Resource may be bound. Naturally, the Resource may be bound at more than one location in the Pipeline, but the Resource must've been created with the restrictions that each Pipeline Bind flag indicates. Sometimes Pipeline Bind flags have restrictions which conflict with each other, so such Pipeline Bind flags are mutually exclusive.
When the Texture2D has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the Texture2D Resource may be read from within shaders with the ld(22.4.6) or sample(22.4.15) instructions, after they are bound to the pipeline through the usage of Views. See the descriptions of these instructions for details. Each Element from a Texture2D to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). Texture2D Resources are addressed from the Shader with a 2D coordinate plus a 3rd coordinate specifying which Array Slice in the Texture2D to fetch from. The 3rd coordinate, if provided as floating point data, is rounded (nearest even), producing an integral array index. Typical 2D filtering occurs on the Array Slice chosen by the 3rd coordinate.
When a Texture2D Mip Slice View is bound as a RenderTarget Output, through the usage of Views, it is allowable to use either an accompanying effective Texture2D Depth/ Stencil View of the same dimensions. For example, if the most detailed Mip Slice View of a Texture2D (width=6, height=4, arraysize=8) is bound as a RenderTarget Output; an effective Texture2D View of (width=6, height=4, arraysize=8) may be used as a Depth/ Stencil. Also, the particular Array Slice in the Texture2D to render is chosen, from the Geometry Shader stage, by declaring a scalar component of output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render to Array Slice 0.
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
When the Texture2D has been created with the Pipeline Bind flag indicating that it may be used as a Depth/ Stencil Output, the Texture2D Resource may only be one of a few Resource Formats (essentially only those which have a 'D' component or those TYPELESS formats which can be converted to a format with a 'D' component), such as D32_FLOAT or R32_TYPELESS, etc.
Resources created with this Pipeline Bind flag cannot also be used as a RenderTarget (the two flags are mutually exclusive).
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. In addition, Depth/ Stencil Resources cannot be a destination for CopyResource(5.6.3), CopySubresourceRegion(5.6.2), nor UpdateSubresourceUP(5.6.8) operations. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
A Texture3D is a 3D grid data layout, supporting mipmaps; and is also known as a Volume Texture. The entire Resource is created atomically. The memory for the entire Resource need not be contiguous. A Texture3D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a Texture3D may be decomposed into sub-groups of Mip Slices, Array Slices, and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.
When the Texture3D has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the Texture3D Resource may be read from within shaders with the ld(22.4.6) or sample(22.4.15) instructions, after they are bound to the pipeline through the usage of Views. See the descriptions of these instructions for details. Each Element from a Texture3D to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). Texture3D Resources are addressed from the Shader with a 3D coordinate. Typical 3D filtering occurs with this coordinate.
When a Texture3D Mip Slice is bound as a RenderTarget Output, through the usage of Views, the Texture3D behaves identically to a Texture2D with n Array Slices where n is the depth (3rd dimension) of the Texture3D. The particular z slice in the Texture3D to render is chosen, from the Geometry Shader stage stage, by declaring a scalar component of output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render\ to z=0.
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
A TextureCube has 6 faces, each of which is like a square Texture2D, including mipmaps. The entire Resource is created atomically. The memory for the entire Resource need not be contiguous. A Texture3D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a TextureCube may be decomposed into sub-groups of Mip Slices, Array Slices (each representing a face), and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.
TextureCubes can also represent an array of cubes, which means a multiple of 6 faces. Used as a Cube Array, the "array" dimension selects which Cube to use. However, the same resource can also be viewed as a 2D Array, in which case each face of each Cube appears as a single location along the "array" dimension.
When the TextureCube has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the TextureCube{Array} Resource may be read from within shaders after they are bound to the pipeline through the usage of Views. The View can expose the TextureCube{Array} as an array of TextureCubes starting from any face (from the perspective of a sequence of 2D faces), then spanning a multiple of 6 faces, such that each 6 faces appears as a location on the array axis. Alternatively, the TextureCube can be viewed as a 2D Array spanning any contiguous set of faces in the resource where each face is a slice, hiding the "Cube-ness" of the resource. Each Element from a TextureCube resource to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). TextureCube Resources viewed as a Cube are addressed from the Shader with a 3D vector pointing out from the center of the TextureCube, and as a Cube Array, an additional coordinate provides the Array Slice. If the Array Slice is provided as a floating point number, is is rounded to nearest even.
When a TextureCube{Array} Mip Slice is bound as a RenderTarget Output, the TextureCube behaves identically to a Texture2DArray, such that any contiguous subset of the faces in the array participate in the View. The particular Array slice in the View to render to is chosen from the Geometry Shader stage, by declaring a scalar component of output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render to Array Slice0.
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
When the TextureCube{Array} has been created with the Pipeline Bind flag indicating that it may be used as a Depth/ Stencil Output, the Resource may only be one of a few Resource Formats (essentially only those which have a 'D' component or those TYPELESS formats which can be converted to a format with a 'D' component), such as D32_FLOAT or R32_TYPELESS, etc. In addition, when rendering using such a Depth/ Stencil TextureCube (viewed as a Texture2DArray Depth Stencil View), only equally sized RenderTarget Views are compatable for use as a RenderTarget Output.
Resources created with this Pipeline Bind flag cannot also be used as a RenderTarget (the two flags are mutually exclusive).
Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. In addition, Depth/ Stencil Resources cannot be a destination for CopyResource(5.6.3), CopySubresourceRegion(5.6.2), nor UpdateSubresourceUP(5.6.8) operations. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.
typedef struct D3D10DDI_HSHADERRESOURCEVIEW { void* m_pDrvPrivate; } D3D10DDI_HSHADERRESOURCEVIEW; typedef struct D3D10DDIARG_BUFFER_SHADERRESOURCEVIEW { union { UINT FirstElement; // Nicer name // < ResourceWidth / ElementSize UINT ElementOffset; }; union { UINT NumElements; // Nicer name // <= ( ResourceWidth / ElementSize - ElementOffset ) UINT ElementWidth; }; } D3D10DDIARG_BUFFER_SHADERRESOURCEVIEW; typedef struct D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW { union { UINT FirstElement; // Nicer name // < ResourceWidth / ElementSize UINT ElementOffset; }; union { UINT NumElements; // Nicer name // <= ( ResourceWidth / ElementSize - ElementOffset ) UINT ElementWidth; }; UINT Flags; // See D3D11_DDI_BUFFEREX_SRV_FLAG_* below } D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW; #define D3D11_DDI_BUFFEREX_SRV_FLAG_RAW 0x00000001 typedef struct D3D10DDIARG_TEX1D_SHADERRESOURCEVIEW { UINT MostDetailedMip; // < Resource MipLevels UINT FirstArraySlice; // < Resource ArraySize UINT MipLevels; // <= ( Resource MipLevels - MostDetailedMip ) UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D10DDIARG_TEX1D_SHADERRESOURCEVIEW; typedef struct D3D10DDIARG_TEX2D_SHADERRESOURCEVIEW { UINT MostDetailedMip; // < Resource MipLevels UINT FirstArraySlice; // < Resource ArraySize UINT MipLevels; // <= ( Resource MipLevels - MostDetailedMip ) UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D10DDIARG_TEX2D_SHADERRESOURCEVIEW; typedef struct D3D10DDIARG_TEX3D_SHADERRESOURCEVIEW { UINT MostDetailedMip; // < Resource MipLevels UINT MipLevels; // <= ( Resource MipLevels - MostDetailedMip ) } D3D10DDIARG_TEX3D_SHADERRESOURCEVIEW; typedef struct D3D10DDIARG_TEXCUBE_SHADERRESOURCEVIEW { UINT MostDetailedMip; UINT MipLevels; } D3D10DDIARG_TEXCUBE_SHADERRESOURCEVIEW; typedef struct D3D10_1DDIARG_TEXCUBE_SHADERRESOURCEVIEW { UINT MostDetailedMip; // < Resource MipLevels UINT MipLevels; // <= ( Resource MipLevels - MostDetailedMip ) UINT First2DArrayFace; // <= ( Resource ArraySize - 5 ) UINT NumCubes; // multiple of 6 faces that must fit in resource after First2DArrayFace } D3D10_1DDIARG_TEXCUBE_SHADERRESOURCEVIEW; typedef struct D3D11DDIARG_CREATESHADERRESOURCEVIEW { D3D10DDI_HRESOURCE hDrvResource; DXGI_FORMAT Format; // Fully qualified D3D10DDIRESOURCE_TYPE ResourceDimension; union { D3D10DDIARG_BUFFER_SHADERRESOURCEVIEW Buffer; D3D10DDIARG_TEX1D_SHADERRESOURCEVIEW Tex1D; D3D10DDIARG_TEX2D_SHADERRESOURCEVIEW Tex2D; D3D10DDIARG_TEX3D_SHADERRESOURCEVIEW Tex3D; D3D10_1DDIARG_TEXCUBE_SHADERRESOURCEVIEW TexCube; D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW BufferEx; }; } D3D11DDIARG_CREATESHADERRESOURCEVIEW; // part of user mode Device interface: STDMETHOD_( SIZE_T, CalcPrivateShaderResourceViewSize )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATESHADERRESOURCEVIEW* pCreateShaderResourceView ); STDMETHOD( CreateShaderResourceView )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATESHADERRESOURCEVIEW* pCreateShaderResourceView, D3D10DDI_HSHADERRESOURCEVIEW hDrvShaderResourceView ); STDMETHOD_( void, DestroyShaderInput )( D3D10DDI_HDEVICE hDrvDevice, D3D10DDI_HSHADERRESOURCEVIEW hDrvShaderResourceView ); typedef struct D3D10DDI_HRENDERTARGETVIEW { void* m_pDrvPrivate; } D3D10DDI_HRENDERTARGETVIEW; typedef struct D3D10DDIARG_BUFFER_RENDERTARGETVIEW { union { UINT FirstElement; // Nicer name // < ResourceWidth / ElementSize UINT ElementOffset; }; union { UINT NumElements; // Nicer name // <= ( ResourceWidth / ElementSize - ElementOffset ) UINT ElementWidth; }; } D3D10DDIARG_BUFFER_RENDERTARGETVIEW; typedef struct D3D10DDIARG_TEX1D_RENDERTARGETVIEW { UINT MipSlice; UINT FirstArraySlice; // < Resource ArraySize UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D10DDIARG_TEX1D_RENDERTARGETVIEW; typedef struct D3D10DDIARG_TEX2D_RENDERTARGETVIEW { UINT MipSlice; UINT FirstArraySlice; // < Resource ArraySize UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D10DDIARG_TEX2D_RENDERTARGETVIEW; typedef struct D3D10DDIARG_TEX3D_RENDERTARGETVIEW { UINT MipSlice; UINT FirstW; // < Resource MipSlice W dimension UINT WSize; // <= ( Resource MipSlice W dimension - FirstW ) } D3D10DDIARG_TEX3D_RENDERTARGETVIEW; typedef struct D3D10DDIARG_TEXCUBE_RENDERTARGETVIEW { UINT MipSlice; UINT FirstArraySlice; // as 2DArray UINT ArraySize; // as 2DArray } D3D10DDIARG_TEXCUBE_RENDERTARGETVIEW; typedef struct D3D10DDIARG_CREATERENDERTARGETVIEW { D3D10DDI_HRESOURCE hDrvResource; DXGI_FORMAT Format; // Fully qualified D3D10DDIRESOURCE_TYPE ResourceDimension; union { D3D10DDIARG_BUFFER_RENDERTARGETVIEW Buffer; D3D10DDIARG_TEX1D_RENDERTARGETVIEW Tex1D; D3D10DDIARG_TEX2D_RENDERTARGETVIEW Tex2D; D3D10DDIARG_TEX3D_RENDERTARGETVIEW Tex3D; D3D10DDIARG_TEXCUBE_RENDERTARGETVIEW TexCube; }; } D3D10DDIARG_CREATERENDERTARGETVIEW; // part of user mode Device interface: STDMETHOD_( SIZE_T, CalcPrivateRenderTargetViewSize )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D10DDIARG_CREATERENDERTARGETVIEW* pCreateRenderTargetView ); STDMETHOD( CreateRenderTargetView )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D10DDIARG_CREATERENDERTARGETVIEW* pCreateRenderTargetView, D3D10DDI_HRENDERTARGETVIEW hDrvRenderTargetView ); STDMETHOD_( void, DestroyRenderTargetView )( D3D10DDI_HDEVICE hDrvDevice, D3D10DDI_HRENDERTARGETVIEW hDrvRenderTargetView ); typedef struct D3D10DDI_HDEPTHSTENCILVIEW { void* m_pDrvPrivate; } D3D10DDI_HDEPTHSTENCILVIEW; typedef struct D3D10DDIARG_TEX1D_DEPTHSTENCILVIEW { UINT MipSlice; UINT FirstArraySlice; // < Resource ArraySize UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D10DDIARG_TEX1D_DEPTHSTENCILVIEW; typedef struct D3D10DDIARG_TEX2D_DEPTHSTENCILVIEW { UINT MipSlice; UINT FirstArraySlice; // < Resource ArraySize UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D10DDIARG_TEX2D_DEPTHSTENCILVIEW; typedef struct D3D10DDIARG_TEXCUBE_DEPTHSTENCILVIEW { UINT MipSlice; UINT FirstArraySlice; // as 2DArray UINT ArraySize; // as 2DArray } D3D10DDIARG_TEXCUBE_DEPTHSTENCILVIEW; typedef enum D3D11_DDI_CREATEDEPTHSTENCILVIEW_FLAG { D3D11_DDI_CREATE_DSV_READ_ONLY_DEPTH = 0x01L, D3D11_DDI_CREATE_DSV_READ_ONLY_STENCIL = 0x02L, D3D11_DDI_CREATE_DSV_FLAG_MASK = 0x03L, } D3D11_DDI_CREATEDEPTHSTENCILVIEW_FLAG; typedef struct D3D11DDIARG_CREATEDEPTHSTENCILVIEW { D3D10DDI_HRESOURCE hDrvResource; DXGI_FORMAT Format; // Fully qualified D3D10DDIRESOURCE_TYPE ResourceDimension; UINT Flags; union { D3D10DDIARG_TEX1D_DEPTHSTENCILVIEW Tex1D; D3D10DDIARG_TEX2D_DEPTHSTENCILVIEW Tex2D; D3D10DDIARG_TEXCUBE_DEPTHSTENCILVIEW TexCube; }; } D3D11DDIARG_CREATEDEPTHSTENCILVIEW; // part of user mode Device interface: STDMETHOD_( SIZE_T, CalcPrivateDepthStencilViewSize )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATEDEPTHSTENCILVIEW* pCreateDepthStencilView ); STDMETHOD( CreateDepthStencilView )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATEDEPTHSTENCILVIEW* pCreateDepthStencilView, D3D10DDI_HDEPTHSTENCILVIEW hDrvDepthStencilView ); STDMETHOD_( void, DestroyDepthStencilView )( D3D10DDI_HDEVICE hDrvDevice, D3D10DDI_HDEPTHSTENCILVIEW hDrvDepthStencilView ); typedef struct D3D11DDI_HUNORDEREDACCESSVIEW { void* m_pDrvPrivate; } D3D11DDI_HUNORDEREDACCESSVIEW; typedef struct D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW { UINT FirstElement; // < ResourceWidth / ElementSize UINT NumElements; // <= ( ResourceWidth / ElementSize - ElementOffset ) UINT Flags; // See D3D11_DDI_BUFFER_UAV_FLAG* below } D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW; #define D3D11_DDI_BUFFER_UAV_FLAG_RAW 0x00000001 #define D3D11_DDI_BUFFER_UAV_FLAG_APPEND 0x00000002 #define D3D11_DDI_BUFFER_UAV_FLAG_COUNTER 0x00000004 typedef struct D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW { UINT MipSlice; UINT FirstArraySlice; // < Resource ArraySize UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW; typedef struct D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW { UINT MipSlice; UINT FirstArraySlice; // < Resource ArraySize UINT ArraySize; // <= ( Resource ArraySize - FirstArraySlice ) } D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW; typedef struct D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW { UINT MipSlice; UINT FirstW; // < Resource MipSlice W dimension UINT WSize; // <= ( Resource MipSlice W dimension - FirstW ) } D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW; typedef struct D3D11DDIARG_CREATEUNORDEREDACCESSVIEW { D3D10DDI_HRESOURCE hDrvResource; DXGI_FORMAT Format; // Fully qualified D3D10DDIRESOURCE_TYPE ResourceDimension; // Runtime will never set this to TexCube union { D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW Buffer; D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW Tex1D; D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW Tex2D; D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW Tex3D; }; } D3D11DDIARG_CREATEUNORDEREDACCESSVIEW; // part of user mode Device interface: STDMETHOD_( SIZE_T, CalcPrivateUnorderedAccessViewSize )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATEUNORDEREDACCESS* pCreateUnorderedAccessView ); STDMETHOD( CreateUnorderedAccessView )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATEUNORDEREDACCESSVIEW* pCreateUnorderedAccessView, D3D10DDI_HUNORDEREDACCESSVIEW hDrvUnorderedAccessView ); STDMETHOD_( void, DestroyDepthStencilView )( D3D10DDI_HDEVICE hDrvDevice, D3D10DDI_HUNORDEREDACCESSVIEW hDrvUnorderedAccessView );
Unordered Access Views (UAVs) can be bound at the Output Merger(17) (available to all graphics shader stages from there) and Compute Shader(18) stage.
At the Output Merger, there is the constraint that the total of the number of o# slots (Render Target Views - RTVs) and u# slots (UAVs) that may be bound simultaneously is at most 64, where no more than 8 can be RTVs. The way this is enforced, for simplicity, is that all o# (RTV) slots that are declared must have a slot # that is less than the minimum # of the u# (UAV) slots that are declared. So it is valid for a Pixel Shader to declare o0, o1, u4 and u63, but it is not valid for a Pixel Shader to declare o0, u3, and o4.
Separating o# from u# this way minimizes future dependence on the fact that they happen to live in the same bind space in D3D11, if that turns out not to be desirable.
The UAVs bound at the Output Merger are visible to all graphics stages (a shared set of UAV bindings). So multiple graphics shader stages can access the same UAVs simultaneously.
Certain shader stages, like the Vertex Shader or Domain Shader (with Tessellation), are implemented by hardware using shader result caches. So if nearby primitives share the same vertex, the results of the corresponding shader invocation for that vertex may be retrieved from a result cache rather than re-executing the shader. The presence of these result caches and their behavior is hardware specific. Previously, without the ability for the unique shader invocations to have side-effects, the user had no way of knowing or depending on any caching taking place, beyond observing some performance wins if the caching worked well. With UAVs available to all shaders (enabling shaders to write arbitrarily to the UAV memory), any hardware-specific shader result caching will be visible, and the burden is left to the application developer to avoid depending on any given hardware's behavior. In particular, the behavior of such caching would not take into account any UAV accesses that take place; the hash key for shader result caching is simply the inputs for a given shader invocation independent of what may be read from UAVs during the shader invocation (which may not occur at all if there is a cache hit).
There is no guarantee that UAV accesses issued from within or across shader stages executing within a given Draw*(), or issued from the Compute Shader within Dispatch*(), finish in the order issued. All UAV accesses are finished at the end of the Draw*()/Dispatch*() though.
The Compute Shader has its own separate set of 64 slots where only UAVs may be bound, independent of the set of RTV+UAV bindpoints for the graphics stages.
In D3D11.0, the number of UAVs was limited to 8 at the Compute Shader and 8 combined RTV+UAV at the Pixel Shader. There have since been requests to increase this limit. In addition, there have been requests to have some sort of logging ability available to all shader stages, at least for debugging purposes. Being able to access UAVs from every graphics Shader Stage permits this.
Dynamic indexing of UAV registers (i.e. dynamically indexing # in u#) is not permitted.
Shader Instructions (defined elsewhere) which are accessing UAVs simply take a u# as a parameter, much like instructions that are sampling from textures take a t# as a parameter.
The D3D11 Resource types that can have a UAV on them are Texture1D{Array}, Texture2D{Array}, Texture3D and Buffer. When the Resource is created at the API/DDI, the bind flag D3D11_{DDI_}BIND_UNORDERED_ACCESS must be specified in order for subsequent creation of UAVs on the resource to be valid.
The D3D11_BIND_UNORDERED_ACCESS flag may be combined with any of the following bind flags:
The D3D11_BIND_UNORDERED_ACCESS flag may NOT be combined with any of the following bind flags:
The constraints combining D3D11_BIND_UNORDERED_ACCESS with other flags on Resource Creation, such as Usage (dynamic, staging etc) are the same as existing constraints present specified for D3D11_BIND_RENDER_TARGET.
The Sample Count on the resource must be 1, and the Sample Quality must be 0.
Note in the DDI, the names above become D3D11_DDI_BIND_*.
typedef struct D3D11DDIARG_CREATEUNORDEREDACCESSVIEW { D3D11DDI_HRESOURCE hDrvResource; DXGI_FORMAT Format; D3D11DDIRESOURCE_TYPE ResourceDimension; union { D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW Buffer; D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW Tex1D; D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW Tex2D; D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW Tex3D; }; } D3D11DDIARG_CREATEUNORDEREDACCESSVIEW;
The Format parameter must be compatible with the format the Resource was created with, and can be any format that supports being bound at the RenderTarget except for SRGB formats. Additional restrictions on the Format for Buffer views are discussed shortly below.
The D3D11DDIARG_*_UNORDEREDACESSVIEW parameters, describing the view parameters based on resource dimension, are as follows:
typedef struct D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW { UINT FirstElement; UINT NumElements; UINT Flags; // see D3D11_DDI_BUFFER_UAV_FLAG* below } D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW; #define D3D11_DDI_BUFFER_UAV_FLAG_RAW 0x00000001 #define D3D11_DDI_BUFFER_UAV_FLAG_STRUCTURED 0x00000002
The _RAW_FLAG allows the shader to access the buffer simply as a 1D array of untyped 32-bit data. The Format must be specified as R32_TYPELESS when this flag is used. The underlying Buffer must have been created with D3D11_DDI_MISC_FLAG_ALLOW_RAW_VIEWS (D3D11_MISC_FLAG_ALLOW_RAW_VIEWS at the API).
The _STRUCTURED flag (mutually exclusive to _RAW) requires that the Buffer was created as a Structured Buffer. The Format for a structured buffer must be specified as DXGI_FMT_UNKNOWN. The type information for the structured buffer will be inherited from the buffer resource.
The absence of _RAW and _STRUCTURED flags means the Buffer View is Typed, so the Format of the view can be specified as freely as any with other UAV dimension (1D, 2D, 3D).
When a UAV or SRV is Raw, the FirstElement parameter (defining the start of the view) must result in a 128bit aligned offset, otherwise the creation of the View will fail. Knowing the base address of a view is conveniently aligned enables various optimizations/assumptions in hardware given accesses from a shader that are offsets from the base of the view (where the offsets are often literals in the shader).
typedef struct D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW { UINT MipSlice; UINT FirstArraySlice; UINT ArraySize; } D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW; typedef struct D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW { UINT MipSlice; UINT FirstArraySlice; UINT ArraySize; } D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW; typedef struct D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW { UINT MipSlice; UINT FirstW; UINT WSize; } D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW;
The D3D11 OMSetRenderTargets API/DDI accepts both RenderTargetViews, DepthStencilView, and UnorderedAccessViews at the same time. This affects the Graphics side of the pipeline, not the Compute side. Here is the DDI:
typedef VOID ( APIENTRY* PFND3D11DDI_SETRENDERTARGETS )( D3D10DDI_HDEVICE, // device handle CONST D3D11DDI_HRENDERTARGETVIEW*, // array of RenderTargetViews, UINT, // index of first RTV to set UINT, // number of RTVs being set (all others unbound) D3D10DDI_HDEPTHSTENCILVIEW, // DepthStencilView CONST D3D11DDI_HUNORDEREDACCESSVIEW*, // array of UnorderedAccessViews, UINT*, // Array of Append buffer offsets (relevant only for // UAVs which have the Append flag (otherwise ignored). // -1 means keep current offset. Any other value sets // the hidden counter for that Appendable UAV. UINT, // index of first start of UAVs to set UINT, // number of UAVs being set (all others unbound) UINT, // the first UAV in the set of updated UAVs (including NULL bindings) UINT // the number of UAVs in the set of updated UAVs (including NULL bindings) )
There is a separate CSSetUnorderedAccessViews API/DDI that accepts UnorderedAccessViews to be bound for the Compute side of the device. It is similar to the above, except doesn’t include RenderTargets.
The last two parameters, UAVRangeStart and UAVRangeSize exist at the DDI level and not at the OMSetRenderTargets API level. The Direct3D 11 runtime tracks the set of bound UAVs which have changed (which may be different from the set of bound UAVs overall) whereby the driver may use this information for optimization purposes.
UAVs have the same precedence in Hazard Tracking as RTVs and SO Targets:
If a subresource is ever bound as an output (RTV/UAV/SO Target), subsequently unbound, and then bound as a shader input, a ReadAfterWriteHazard DDI is called. Drivers can use this as a hint as to when a rendering flush may be required. There are additional situations where Read After Write hazards are reported given the two pipelines – Graphics and Compute, in particular resources moving from output binding on one side to input binding on the other side, as well Compute outputs moving to Compute input. Note UAVs are considered as "output", since if an application only needs to read a resource, it should be bound as an input instead.
There is a significant and unfortunate limitation in many hardware designs that had to be built into D3D. While Typed UAVs support many formats – essentially any format that can be a RenderTarget - the majority of these formats only support being written as a UAV, but not read at the same time.
Shader Resource Views are of course always available in any shader stage when only read-only access from arbitrary locations in a Typed resource is needed. Conversely, it is useful that if write-only access to arbitrary locations in a Typed resource is needed, UAVs support that scenario.
However, simultaneous reading and writing to a UAV within a single Draw* or Dispatch* operation is only supported if the UAV’s Type is R32_UINT/_SINT/_FLOAT. In particular, the ld_uav_typed IL instruction for reading from a typed UAV is limited to R32_UINT/_SINT/_FLOAT formats. E.g. a UAV with a type such as R8G8B8A8_UNORM_SRGB cannot be read from (but it can be written).
D3D has a partial workaround for this inability to simultaneously read+write from Typed UAVs. The purpose is to make tasks such as editing an image in-place simpler, given the circumstances.
D3D allows Texture1D/2D/3D resources created with any of the following small set of 32-bit per element formats to have UAVs created from them with R32_UINT/_SINT/_FLOAT as the type:
Once an R32_* UAV is created, it allows arbitrary reading and writing to the UAV’s memory in-place. The catch is there is no type conversion since the format is R32_*, meaning reads and writes simply move raw data unaltered between a shader and memory. Since the desire of the application is that the memory is really interpreted as some format like DXGI_FORMAT_R8G8B8A8_UNORM_SRGB, the application is responsible for manually performing type conversion in the shader code upon reads and writes to the R32_* UAV.
The upside is that because the original resource was created with one of the _TYPELESS formats listed above, it allows other views such as Shader Resource Views or Render Target Views to be created using the actual format that the application intended – such as DXGI_FORMAT_R8G8B8A8_UNORM_SRGB. These properly typed views can then benefit from the fixed-function hardware type conversion upon reading and writing to the format during texture filtering on read or blending on writes, even though these were not available to the UAV, where manual type conversion code had to be done in the shader.
The formats supporting this casting to R32_* are limited those for which the hardware really makes no difference in memory layout versus R32_*, but excluding a few that have complex encoding cost such as DXGI_FORMAT_R11G11B10_FLOAT. If this ability to cast to R32_* UAVs was not included in D3D, applications would have to perform a copy rendering pass to move data from an R32_* resource where the image editing occurred to a separate resource that has the desired type (e.g. R10G10B10A2_UNORM), which is a waste of memory.
Unordered Append Buffers enable a usage pattern whereby Pixel Shader and Compute Shaders can write structures of data to memory in variable quantity, in an unordered way. Hardware can take advantage of knowing this type of operation is going on, producing optimized performance.
For Structured Buffers that have been created with the Bind flag: D3D11_DDI_BIND_UNORDERED_ACCESS, Unordered Access Views can be created with one of the optional flags D3D11_DDI_BUFFER_UAV_FLAG_COUNTER or D3D11_DDI_BUFFER_UAV_FLAG_APPEND. The latter flag gives up some flexibility for (possibly) performance – described later.
Creating a Structured Buffer UAV with UAV_FLAG_COUNTER causes the driver to allocate storage for a single hidden 32-bit unsigned integer counter associated with the UAV (as opposed to being associated with the underlying resource), initialized to 0. Multiple UAVs created on the same Buffer with this flag will thus have multiple independent counters.
Shaders can atomically increment or decrement this count (but not do both in one shader) and use the returned index to indicate which structure index in the UAV to access. If the _COUNTER flag is used, count values (representing struct index) returned to the shader may be saved for use later after the shader has completed, for example for linked lists.
If the _APPEND flag is used when creating the UAV, a counter is created like with the _COUNTER flag, except the counter values returned to a shader invocation when incrementing or decrementing the count are only valid for the lifetime of the shader invocation. So the shader can use the index during the shader invocation to access the corresponding struct index in the UAV, but the hardware is permitted to reorder the struct layout from the point of view of anything outside the shader invocation, or after the shader invocation is complete. This is for cases where an application is simply generating struct records and it does not care that the order of the records is maintained. However if the application goes out of its way to examine the buffer (such as copying from it or using some other type of View) the hardware will have to pack the records into the range of struct locations corresponding to the number of times shader invocations incremented the counter on a given UAV. Even though the data will appear packed, the structs may be reordered. Some hardware will take advantage of not having to maintain the order to provide better access performance.
When Pixel Shaders and Compute Shaders bind UAVs that have _COUNT or _APPEND usage specified, an initial value for the View’s hidden counter must be provided as part of the bind call. Specifying -1 means maintain the current counter value already in the Buffer. Any other value sets the counter value.
When an Append UAV is bound to the pipeline, the instructions that can access it are restricted to the following:
imm_atomic_alloc(22.17.17)For an Append UAV, the HLSL compiler can use imm_atomic_alloc to obtain an "address" and then use a sequence of store_* commands to write out data a unique location in the unordered output to the UAV.
Conversely, the HLSL compiler can use imm_atomic_consume to obtain an "address" that already has data and then use a sequence of ld_* commands to read back data from a unique location in the UAV.
For Append UAVs, the count values returned by imm_atomic_alloc and imm_atomic_consume are hidden from the shader by the HLSL compiler, which exposes simply the ability to Append() structs or Consume() structs (not both in the same shader).
For Count UAVs, where the returned count value may be stored, any instructions capable of accessing Structured Buffers are permitted from the shader, in addition to all of the instructions listed above. Unlike Append UAVs, the HLSL compiler exposes the count values returned by imm_atomic_alloc and imm_atomic_consume for access in the shader – allowing the value to be saved.
The counter behind imm_atomic_alloc and imm_atomic_consume has no overflow or underflow clamping, and there is no feedback given to the shader as to whether overflow/underflow happened (wrapping of the counter). The only thing the counter really accomplishes is a way of generating unique addresses that is conveniently bundled with the UAV.
It is invalid for a single shader, or multiple shaders in flight on a GPU, to have the presence of both imm_atomic_alloc and imm_atomic_consume instructions operating on the same UAV. For a single shader, compilation fails if these operations (however they appear in HLSL) are mixed. The GPU must guarantee that Shader invocations from separate Draw*/Dispatch operations do not run out of sequence when there is a possibility that an alloc/consume hazard could exist.
The counter associated with a Count/Append UAV is somewhat like the counters that are associated with Stream Output buffers (note a Buffer cannot be both a Stream Output and Count/Append Buffer), although those counters have slightly different semantics. There is an API/DDI CopyStructureCount which allows the hidden count in a Count/Append UAV to be copied to another Buffer. This can serve as the vertex count parameter to Draw*InstancedIndirect, allowing data that has been written to an Append Buffer to be recirculated back into the GPU without CPU knowledge of the exact quantity involved.
When Append/Count UAVs are bound to the pipeline the application can specify what the initial counter value should be, or choose to maintain the existing count value.
For an Append UAV, since the storage is unordered, when binding the UAV to the pipeline as a UAV or any other tpe of view (e.g. SRV), the contents of any struct entries in the UAV beyond the count value become undefined, and any contents within the count value are maintained, but may be reordered. It is fine for multiple different types of UAVs to overlap, but the application has to beware of the effect that the unordered nature of Append UAVs may have (when bound/used) on other overlapping views of the same memory. It is safest for an application not to mix usage of overlapping UAVs with expectations of data order being maintained in between.
Count UAVs do not create any such ordering issues, since by definition applications are allowed to save count values as references to specific locations in the UAV.
For some implementations, Append UAVs will behave identically to Count UAVs (e.g. no reordering). Still, if the application does not care about the ordering of records being maintained in the UAV, it does not hurt (and can only help on some implementations) to make use of the constrained Append semantics for generating and subsequently consuming unordered collections of items.
As of the D3D11.1 API/DDI, Video Resources can have SRV/RTV/UAVs created so that D3D shaders can process them. The way the underlying Video Resource shows up in D3D as an ID3D11Resource* is described in separate D3D11 Video specs. This section covers how given an ID3D11Resource* to a Video Resource, SRV/RTV/UAVs can be created in D3D.
These Video Resources will be either Texture2D or Texture2DArray, so the ViewDimension in the VIEW_DESC structure must match. Additionally, the format of the underlying Video Resource restricts the formats that the View can use.
The following table describes all the combinations of Video Resource and View(s) that can be made from them. Note that multiple views of different parts of the same surface can be created, and depending on the format they may have different sizes from each other. A few video formats do not support D3D SRV/UAV/RTVs at all: DXGI_FORMAT_420_OPAQUE, _AI44, _IA44, _P8 and _A8P8. Further details on all the video formats is provided in the D3D11 Video DDI spec.
Runtime read+write conflict prevention logic (which stops a resource from being bound as an SRV and RTV/UAV at the same time) treats Views of different parts of the same Video surface as conflicting for simplicity. It doesn’t seem interesting to allow the case of reading from luma while simultaneously rendering to chroma in the same surface, for example, even though it may be possible in hardware.
Video Resource Format (DXGI_FORMAT_*) |
Valid View Format (DXGI_FORMAT_*) |
Meaning | Mapping to View Channel |
View Types Supported |
---|---|---|---|---|
AYUV (This is the most common YUV 4:4:4 format) | R8G8B8A8_{UNORM|UINT}, or for UAVs, an additional choice: R32_UINT | Straightforward mapping of the entire surface in one view. Using R32_UINT for UAVs allows both read and write (as opposed to just write for the other format) | V->R8, U->G8, Y->B8, A->A8 | SRV, RTV, UAV |
YUY2 (This is the most common YUV 4:2:2 format) | R8G8B8A8_{UNORM|UINT}, or for UAVs, an additional choice: R32_UINT | Straightforward mapping of the entire surface in one view. Using R32_UINT for UAVs allows both read and write (as opposed to just write for the other format) | Y0->R8, U0->G8, Y1->B8, V0->A8 | SRV, UAV |
R8G8_B8G8_UNORM | In this case the width of the view will appear to be twice the R8G8B8A8 view would be, with hardware reconstruction of RGBA done automatically on read (and before filtering). This has been in D3D hardware for a long time (legacy) though it likely is not interesting any more. | Y0->R8, U0->G8[0], Y1->B8, V0->G8[1] | SRV | |
NV12 (This is the most common YUV 4:2:0 format) | R8_{UNORM|UINT} | Luminance Data View | Y->R8 | SRV, RTV, UAV |
R8G8_{UNORM|UINT} | Chrominance Data View (width and height are each 1/2 of luminance view) | U->R8, V->G8 | SRV, RTV, UAV |
|
NV11 (This is the most common YUV 4:1:1 format) | R8_{UNORM|UINT} | Luminance Data View | Y->R8 | SRV, RTV, UAV |
R8G8_{UNORM|UINT} | Chrominance Data View (width and height are each 1/4 of luminance view) | U->R8, V->G8 | SRV, RTV, UAV |
|
P016 (This is a 16 bit per channel planar 4:2:0 format) | R16_{UNORM|UINT} | Luminance Data View | Y->R16 | SRV, RTV, UAV |
R16G16_{UNORM|UINT}, or for UAVs, an additional choice: R32_UINT | Chrominance Data View (width and height are each 1/2 of luminance view) Using R32_UINT for UAVs allows both read and write (as opposed to just write for the other format) | U->R16, V->G16 | SRV, RTV, UAV |
|
P010 (This is a 10 bit per channel planar 4:2:0 format) | R16_{UNORM|UINT} | Luminance Data View D3D does not enforce or care whether or not the lowest 6 bits are 0 (given this is a 10 bit format using 16 bits) – application shader code would have to enforce this manually if desired. From the D3D point of view, this is format is no different than P016. | Y->R16 | SRV, RTV, UAV |
R16G16_{UNORM|UINT, or for UAVs, an additional choice: R32_UINT | Chrominance Data View (width and height are each 1/2 of luminance view) Using R32_UINT for UAVs allows both read and write (as opposed to just write for the other format) Same comment as above about this 10 bit format using 16 bits. | U->R16, V->G16 | SRV, RTV, UAV |
|
Y216 (This is a 16 bit per channel packed 4:2:2 format) | R16G16B16A16_{UNORM|UINT} | Straightforward mapping of the entire surface in one view. | Y0->R16, U->G16, Y1->B16, V->A16 | SRV, UAV |
Y210 (This is a 10 bit per channel packed 4:2:2 format) | R16G16B16A16_{UNORM|UINT} | Straightforward mapping of the entire surface in one view. D3D does not enforce or care whether or not the lowest 6 bits are 0 (given this is a 10 bit format using 16 bits) – application shader code would have to enforce this manually if desired. From the D3D point of view, this is format is no different than Y216. | Y0->R16, U->G16, Y1->B16, V->A16 | SRV, UAV |
Y416 (This is a 16 bit per channel packed 4:4:4 format) | R16G16B16A16_{UNORM|UINT} | Straightforward mapping of the entire surface in one view. | U->R16, Y->G16, V->B16, A->A16 | SRV, UAV |
Y410 (This is a 10 bit per channel packed 4:4:4 format) | R10G10B10A2_{UNORM|UINT}, or for UAVs, an additional choice: R32_UINT | Straightforward mapping of the entire surface in one view. Using R32_UINT for UAVs allows both read and write (as opposed to just write for the other format). | U->R10, Y->G10, V->B10, A->A2 | SRV, UAV |
Resources have the following properties in common, specified at Resource creation:
Resources are made up of one of more Subresources. These Subresources share a common lifespan with each other and the Resource. In other words, the Resource and Subresources are atomically allocated and destroyed. However, some operations occur at the Subresource level, versus the Resource level. Subresources are three dimensional entities (with height, width, depth, pitch, and slice pitch), but degenerate into two and one dimensional entities for a certain Resource. For ex. a fully mipped Texture2D Resource creation with a width of two, a height of two, and an array size of two will have four Subresources that can be individually referenced for certain operations. Two Subresources have a width of two, height of two, and depth of one. These two Subresources are the most detailed mip level. The additional two Subresources have a width of one, height of one, and depth of one. Each Subresource is allowed to have it's own address, so the Resource may have somewhere between one and four disjoint allocations to satisfy the previous example. Each Subresource inherits the properties of the Resource, and Subresources may not be part of multiple Resources.
typedef enum D3D10DDIRESOURCE_TYPE { D3D10DDIRESOURCE_BUFFER = 1, D3D10DDIRESOURCE_TEXTURE1D = 2, D3D10DDIRESOURCE_TEXTURE2D = 3, D3D10DDIRESOURCE_TEXTURE3D = 4, D3D10DDIRESOURCE_TEXTURECUBE = 5, #if D3D11DDI_MINOR_HEADER_VERSION >= 1 D3D11DDIRESOURCE_BUFFEREX = 6, #endif } D3D10DDIRESOURCE_TYPE; typedef struct D3D10DDI_MIPINFO { UINT TexelWidth; UINT TexelHeight; UINT TexelDepth; UINT PhysicalWidth; UINT PhysicalHeight; UINT PhysicalDepth; } D3D10DDI_MIPINFO; typedef struct D3D10_DDIARG_SUBRESOURCE_UP { VOID* pSysMem; UINT SysMemPitch; UINT SysMemSlicePitch; } D3D10_DDIARG_SUBRESOURCE_UP; typedef struct D3D11DDI_HRESOURCE { void* m_pDrvPrivate; } D3D11DDI_HRESOURCE; // Bits for D3D11DDI_CREATERESOURCE::BindFlags typedef enum D3D10_DDI_RESOURCE_BIND_FLAG { D3D10_DDI_BIND_VERTEX_BUFFER = 0x00000001L, D3D10_DDI_BIND_INDEX_BUFFER = 0x00000002L, D3D10_DDI_BIND_CONSTANT_BUFFER = 0x00000004L, D3D10_DDI_BIND_SHADER_RESOURCE = 0x00000008L, D3D10_DDI_BIND_STREAM_OUTPUT = 0x00000010L, D3D10_DDI_BIND_RENDER_TARGET = 0x00000020L, D3D10_DDI_BIND_DEPTH_STENCIL = 0x00000040L, D3D10_DDI_BIND_PIPELINE_MASK = 0x0000007FL, D3D10_DDI_BIND_PRESENT = 0x00000080L, D3D10_DDI_BIND_MASK = 0x000000FFL, #if D3D11DDI_MINOR_HEADER_VERSION >= 1 D3D11_DDI_BIND_UNORDERED_ACCESS = 0x00000100L, D3D11_DDI_BIND_PIPELINE_MASK = 0x0000017FL, D3D11_DDI_BIND_MASK = 0x000001FFL, #endif } D3D10_DDI_RESOURCE_BIND_FLAG; // Bits for D3D11DDI_CREATERESOURCE::MapFlags typedef enum D3D10_DDI_CPU_ACCESS { D3D10_DDI_CPU_ACCESS_WRITE = 0x00000001L, D3D10_DDI_CPU_ACCESS_READ = 0x00000002L, D3D10_DDI_CPU_ACCESS_MASK = 0x00000003L, } D3D10_DDI_CPU_ACCESS; // Bits for D3D11DDI_CREATERESOURCE::Usage typedef enum D3D10_DDI_RESOURCE_USAGE { D3D10_DDI_USAGE_DEFAULT = 0, D3D10_DDI_USAGE_IMMUTABLE = 1, D3D10_DDI_USAGE_DYNAMIC = 2, D3D10_DDI_USAGE_STAGING = 3, } D3D10_DDI_RESOURCE_USAGE; // Bits for D3D11DDI_CREATERESOURCE::MiscFlags typedef enum D3D10_DDI_RESOURCE_MISC_FLAG { D3D10_DDI_RESOURCE_AUTO_GEN_MIP_MAP = 0x00000001L, D3D10_DDI_RESOURCE_MISC_SHARED = 0x00000002L, // Reserved for D3D11_RESOURCE_MISC_TEXTURECUBE 0x00000004L, D3D10_DDI_RESOURCE_MISC_DISCARD_ON_PRESENT = 0x00000008L, #if D3D11DDI_MINOR_HEADER_VERSION >= 1 D3D11_DDI_RESOURCE_MISC_DRAWINDIRECT_ARGS = 0x00000010L, D3D11_DDI_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS = 0x00000020L, D3D11_DDI_RESOURCE_MISC_BUFFER_STRUCTURED = 0x00000040L, D3D11_DDI_RESOURCE_MISC_RESOURCE_CLAMP = 0x00000080L, #endif // Reserved for D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX 0x00000100L, // Reserved for D3D11_RESOURCE_MISC_GDI_COMPATIBLE 0x00000200L, D3D10_DDI_RESOURCE_MISC_REMOTE = 0x00000400L, } D3D10_DDI_RESOURCE_MISC_FLAG; typedef struct D3D11DDIARG_CREATERESOURCE { CONST D3D10DDI_MIPINFO* pMipInfoList; CONST D3D10_DDIARG_SUBRESOURCE_UP* pInitialDataUP; // non-NULL if Usage has invariant D3D10DDIRESOURCE_TYPE ResourceDimension; // Part of old Caps1 UINT Usage; // Part of old Caps1 UINT BindFlags; // Part of old Caps1 UINT MapFlags; UINT MiscFlags; DXGI_FORMAT Format; // Totally different than D3DDDIFORMAT DXGI_SAMPLE_DESC SampleDesc; UINT MipLevels; UINT ArraySize; // Can only be non-NULL, if BindFlags has D3D10_DDI_BIND_PRESENT bit set; but not always. // Presence of structure is an indication that Resource could be used as a primary (ie. scanned-out), // and naturally used with Present (flip style). (UMD can prevent this- see dxgiddi.h) // If pPrimaryDesc absent, blt/ copy style is implied when used with Present. DXGI_DDI_PRIMARY_DESC* pPrimaryDesc; UINT ByteStride; // 'StructureByteStride' at API } D3D11DDIARG_CREATERESOURCE; // part of user mode Device interface: STDMETHOD_( SIZE_T, CalcPrivateResourceSize )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATERESOURCEIN* pCreateResourceIn ); STDMETHOD( CreateResource )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_CREATERESOURCEIN* pCreateResourceIn, D3D11DDI_HRESOURCE hDrvResource ); STDMETHOD_( void, DestroyResource )( D3D10DDI_HDEVICE hDrvDevice, D3D11DDI_HRESOURCE hDrvResource );
A structured buffer(5.1.3) is created by specifying both a new misc flag and the stride of the structure.
The only D3D11 Resource type that can have a structure defined is the Buffer type. When the Resource is created at the API, the misc flag D3D11_RESOURCE_MISC_STRUCTURED_BUFFER and a structure stride in bytes must be specified.
The StructureByteStride can be at most 2048 bytes.
The D3D11_RESOURCE_MISC_STRUCTURED_BUFFER flag cannot be combined with D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS (described elsewhere).
The D3D11_RESOURCE_MISC_STRUCTURED_BUFFER flag may be combined with any of the following bind flags:
The D3D11_RESOURCE_MISC_STRUCTURED_BUFFER flag may NOT be combined with any of the following bind flags:
Buffers that define a structure cannot be used with the InputAssembler, either for vertex or index data. Structured buffers also cannot be bound as a stream output target or render target.
If the D3D11_RESOURCE_MISC_STRUCTURED_BUFFER is not set, then StructureByteStride parameter to the Buffer creation must be 0. If not, the runtime will fail the creation call.
If the D3D11_RESOURCE_MISC_STRUCTURED_BUFFER is set, then StrideInBytes must be non-zero and ByteWidth must be evenly divisible by StructureByteStride . If either condition is not true when creating a structured buffer, the create call will be failed by the runtime.
Resource size dimensions (Width, Height, Depth) are always specified in pixel units. Size dimensions are restricted only for subsampled and block compressed formats (see Formats(19.1) section), and are otherwise restricted only to positive integers. Furthermore, the size dimensions of a Resource have no bearing on what functionality is available for the resource (such as filtering support).
Resource pitches are always expressed in bytes, and indicate the memory delta between the start of pixel rows or array slices, with the only exception being block compressed formats, where the pitch is defined as between between 'block' rows instead of pixel rows. Pitch values are restricted only to non-negative integers, intentionally including zero for which the first row will be replicated to all rows.
Size dimensions for lower level mipmapped resources are computed by the Direct3D runtime based on the size of the level zero map. These computed dimensions are adjusted upward as necessary to adhere to physical size dimension restrictions for subsampled and block compressed formats - refer to the discusson of physical and virtual dimensions in Block Compressed Formats(19.5) and Sub-Sampled Formats(19.4).
Section Contents
(back to chapter)
5.6.1 Mapping
Mapping/ locking is done at the Subresource level, instead of the Resource level. Mapping means granting CPU access to the Subresource's storage or contents. Typically, the user mode driver must invoke the Lock callback to achieve this operation. The application subsequently relinquishes direct access to mapped Subresources by unmapping them. Only one Map for a given Subresource is allowed (even for non-overlapping regions) and no accelerator operations on a Subresource may be ongoing while a Map is outstanding on that Subresource. However, multiple Subresources of the same Resource may be Mapped at the same time. Each Map method returns a structure that contains a pointer to the storage backing the Resource, and pitch values representing the distances between rows or planes of data, depending on the Subresource dimensionality. The returned pointer always points to the top-left byte (U = 0, V = 0, W = 0) to the mapped Subresource. The layout is similar to that of a multidimensional 'C' array, where the Subresource can be considered to be the following 'C' declaration:
Pixel_Type Subresource [ W ][ V ][ U ];
with the additional characteristic that the driver is allowed to specify the byte pitch between each row (or block-row for BC formats) and each depth slice.
When returning a pointer to the mapped resource, the pointer must be 16-byte aligned. This restriction allows applications to perform SSE-optimized operations on the data natively, without realignment or copy (example usages include CPU geometry and texture processing).
// D3D11.3 Mapping/ Locking: // One, more, or none: CPUREAD, CPUWRITE // Exclusively one or none: RANGEVALID, AREAVALID, BOXVALID // Exclusively one or none: DISCARDRESOURCE // Bits for D3D11DDIARG_MAPIN::Flags #define D3D11DDILOCK_CPUREAD #define D3D11DDILOCK_CPUWRITE #define D3D11DDILOCK_RANGEVALID #define D3D11DDILOCK_AREAVALID #define D3D11DDILOCK_BOXVALID #define D3D11DDILOCK_DISCARDRESOURCE #define D3D11DDILOCK_NOOVERWRITE typedef struct D3D11DDIARG_MAPIN { D3D11DDI_HRESOURCE hResource; // in: resource identifier UINT32 Subresource; // in: zero based subresource index UINT32 Flags; // in: flags } D3D11DDIARG_LOCKIN; typedef struct D3D11DDIARG_MAPOUT { void* pSurfData; // out: pointer to memory SIZE_T Pitch; // out: pitch of memory SIZE_T SlicePitch; // out: slice pitch of memory } D3D11DDIARG_MAPOUT; typedef struct D3D11DDIARG_UNMAPIN { D3D11DDI_HRESOURCE hResource; // in: resource identifier UINT32 Subresource; // in: zero based subresource index } D3D11DDIARG_UNMAPIN; // part of user mode Device interface: STDMETHOD( Map )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_MAPIN* pMapIn, D3D11DDIARG_MAPOUT* pMapOut ) = 0; STDMETHOD( Unmap )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_UNMAPIN* pUnmapIn ) = 0;
Map() allows NO_OVERWRITE for Buffers with DYNAMIC usage and the SHADER_RESOURCE (shader input) bind flag. Before D3D11.1 this was disallowed (though DISCARD was allowed).
Before the first call with NO_OVERWRITE on a deferred context, a DISCARD must be done on the same context (via Copy*()/Update*()/Map() API flag or Discard*() API). This is not required on immediate contexts if the application knows the GPU is finished with the resource (though discard can be used if not).
This feature is required to be supported for all D3D10+ hardware with D3D11.1 drivers.
The background here is that Map() NO_OVERWRITE used to be allowed on Dynamic Index Buffers or Vertex Buffers. Game developers would use this to perform a sliding window of successive buffer updates while rendering follows along. The driver would not have to rename the surface and the GPU did not have to flush rendering while it referenced the Buffer even as the application updated other parts of it.
Increasingly developers have found reasons to pass the same sort of data into shaders directly (via Shader Resource View) to take advantage of the extra flexibility versus the fixed function semantics of Vertex and Index Buffers at the Input Assembler. As of D3D10, Map() NO_OVERWRITE was not allowed on DYNAMIC Buffers with the Shader Resource bind flag, however. This was simply an oversight, hindering the ability to efficiently feed vertex/index style data directly to shaders.
Map() can be called on Buffers with DEFAULT usage and SHADER_RESOURCE and/or UNORDERED_ACCESS bind flags.
The Buffer can have MiscFlags BUFFER_ALLOW_RAW_VIEWS, BUFFER_STRUCTURED or nothing.
Before D3D11.2 this was disallowed. As of D3D11.2, this feature is required to be supported for Feature Level 11.0+ devices with WDDM1.3+ drivers.
The goal here was to reduce the number of copies required to transfer Buffer data to and from the GPU. Previously, to allow CPU access of the data generated in a DirectCompute computation, an app had to perform an intermediate copy to a STAGING resource. This was due to the fact that only STAGING resources could be directly accessed by the CPU. The need for this copy resulted in a measureable performance hit on bandwidth-intensive DirectCompute scenarios.
This feature exposed the ability to create Default buffers marked with D3D11_CPU_ACCESS_FLAGs, as long as their creation description matched the specific configuration options described. These restrictions were designed merely to scope down the investigation and development work to fit within budget while enabling the core scenario, not because hardware necessarily has the same degree of constraint.
This function allows sub-region copying of data from one Subresource to another. No stretch, color key, blend, nor format conversion. However, format types of each Subresource need not be exactly equal to each other, as the Resource may be Prestructured+Typeless Memory(5.1.5), which is also supported. For example, a R32_FLOAT Texture can be copied to an R32_UINT Texture, as both of these formats are in the same R32_TYPELESS group. Conceptually, the interpreted value of texels changes during this type of copy; but the raw value of memory happens to be equal. This function also works when both Subresources are Unstructured Memory(5.1.2) also, except that the regions to copy will be in raw bytes, versus pixel or Element units.
In addition, the Subresources need not be of equal size; but the source and destination regions must fit entirely within the Subresources. The source and destination Subresources must not be the same Subresources.
Resources which can be used as Depth/ Stencil cannot partipate in this operation as a destination; but they can as a source. Multisampled Resources cannot partcipate in Copy operations.
typedef struct D3D11DDIARG_COPYSUBRESOURCEREGIONIN { D3D11DDI_HRESOURCE hDstResource; // in: resource identifier UINT32 DstSubresource; // in: zero based subresource index POINT3D DstPoints; // in: Destination Offset D3D11DDI_HRESOURCE hSrcResource; // in: resource identifier UINT32 SrcSubresource; // in: zero based subresource index CONST D3D11_BOX* SrcBox; // in: Source Region } D3D11DDIARG_COPYSUBRESOURCEREGIONIN; // part of user mode Device interface: STDMETHOD( CopySubresourceRegion )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_COPYSUBRESOURCEREGIONIN* pCopySubresourceRegionIn ) = 0;
CopySubresourceRegion*() allow the source and dest to be the same resource, with D3D11.1 drivers. The driver must handle overlapping copies.
This feature is required to be supported for all D3D10+ hardware with D3D11.1 runtime+drivers. When the application uses feature level 9.x all drivers support this with the D3D11.1 runtime.
CopySubresourceRegion*() allows a new TILEABLE flag when the source is a currently bound RenderTarget (flag ignored otherwise). This is intended for tile / deferred rendering GPUs (no impact on the copy for non-tiled rendering GPUs). The flag indicates that if the GPU happens to be processing only given tile of a RenderTarget at a time (where the RenderTarget is the source in the copy), the GPU can break the copy call to occur per-tile along with the surrounding rendering calls batched for the scene, without having to flush the scene for all tiles.
The application is guaranteeing that future access to the destination of the copy will only be used for 1:1 cycling of that data back into the same pixel location of the affected RenderTarget (which remains bound). Said another way, the application is guaranteeing that when a tiling GPU replays batched rendering commands to produce any given tile, there will be no visible effect (e.g. to commands earlier in the batch) of the copy having already occured for previously processed tiles.
The source and dest don't have to be the same size resource; this flag is relevant to just the region being copied.
When the application is finished using the target of the TILEABLE copy for recirculating back to the original surface, DiscardResource() should be called if the contents are no longer needed (but this is not strictly required). For some implementations, knowing the end of life of the data in the scratch surface could allow the entire copy to be optimized away into leaving the data in fast tile memory and never having to write it out to GPU memory.
If an application violates the 1:1 property when using the TILEABLE flag on CopySubresourceRegion, such as reading into a different pixel, or into a shader stage other than the Pixel Shader in the second pass, the the data being read is undefined (it will have been generated by an unknown rendering pass by the application or uninitilized).
If the RenderTarget gets unbound, any copies from it that happened with the TILEABLE flag while bound lose the TILEABLE property after the RenderTarget unbinding.
This feature is available for all D3D9+ hardware with D3D11.1 drivers (D3D9 portion of the DDI for D3D9 hardware and both D3D9 and D3D11.1 portions of the DDI for D3D10+ hardware).
This feature will be exposed only to customers of Direct3D within the Windows OS, at least initially, given the narrowly focused application.
An example of a valid scenario (Direct2D will do something similar to this, and likely other Windows components):
The example does not work if additional copies are inserted from surface to surface (the length of the cycle can't be extended) - doing so just means the TILEABLE flag loses its value and the GPU will likely have to flush the scene. Behavior should be correct here but performance gains may be lost. In general just because the TILEABLE flag is used on a Copy doesn't mean there will not be a mid-scene flush - that could happen for other reasons, typically changing of RenderTargets. The tileable flag just means there is one less trigger for mid-scene flushes.
This function allows copying of an entire Resource, assuming the Resources are identical types and dimensions. No stretch, color key, blend, nor format conversion. However, format types of each Subresource need not be exactly equal to each other, as the Resource may be Prestructured+Typeless Memory(5.1.5), which is also supported. For example, a R32_FLOAT Texture can be copied to an R32_UINT Texture, as both of these formats are in the same R32_TYPELESS group. Conceptually, the interpreted value of texels changes during this type of copy; but the raw value of memory happens to be equal. This function also works when both Resources are Unstructured Memory(5.1.2).
Resources which can be used as Depth/ Stencil cannot partipate in this operation as a destination; but they can as a source. Multisampled Resources cannot partcipate in Copy operations. This operation also impacts heavily on performant readback and upload scenarios.(5.3.2)
typedef struct D3D11DDIARG_COPYRESOURCEIN { D3D11DDI_HRESOURCE hDstResource; // in: resource identifier D3D11DDI_HRESOURCE hSrcResource; // in: resource identifier } D3D11DDIARG_COPYRESOURCEIN; // part of user mode Device interface: STDMETHOD( CopyResource )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_COPYRESOURCEIN* pCopyResourceIn ) = 0;
On the ARM CPU, cache coherency isn’t provided when the GPU writes to system memory, so a GPU driver would normally be tempted to put a staging (D3D CPU memory) surface in uncached memory (which is slow for CPU access) to avoid incorrect values being read from the cache. However, the Win8 Video Memory Manager will manually flush the CPU cache on ARM when data has been copied from the GPU to a staging surface – so GPU drivers can safely use cacheable memory for STAGING surfaces (yielding good performance on CPU reads). VidMM will also flush CPU caches for the opposite case as well - before the GPU reads from a STAGING surface.
At the D3D11.1 DDI, when a STAGING surface is created, the CPU_ACCESS flags (READ and/or WRITE) are mapped directly down through the DDI, so there it is obvious to drivers when the cacheable memory choice should be made (when WRITE is not set). For the D3D9 DDI (which all drivers for all hardware feature levels must implement), the mapping from D3D11's CPU_ACCESS flags to the D3D9 DDI’s is described in the separate API/DDI spec - see PFND3DDDI_CREATERESOURCE - the situation is SYSTEMMEMORY surfaces that don't have the WriteOnly flag set at the D3D9 DDI.
A note for User Mode drivers: The driver must not cache Map on surfaces that rely on the software enforced coherency described above (i.e. surface is cacheable but mapped into an aperture segment which doesn’t support CacheCoherency). The driver must explicitly call LockCb and UnlockCb at every Map for such surfaces to give an opportunity to VidMm to apply the proper memory barrier. Failing to do so will result in the surface getting corrupted over time.
CopyResource and CopySubresourceRegion allow either or both the source and destination to be structured buffers. It is possible to copy from linear to structured, structured to linear, and structured to structured. If copying between structured buffers, the strides must be the same or the runtime will fail the copy operation. If the region to copy is not specified as complete structures, then the runtime will fail the copy operation.
When the either the source or destination is linear and the other is structured, it is up to the driver to do rearrange the layout if necessary. If structured buffers are stored linearly, then the copy operation is a straightforward copy. If not stored linearly, then any tiling or other reorganization must occur as part of the copy operation.
Only multisample render targets are able to be resolved to a single-sampled resource. Naturally, the source must be a multisampled render target, while the destination must be a single-sampled resource restricted such that it resides in video memory. For example, the destination cannot be a dynamic or system-memory friendly Resource. Thus the destination Resource must be USAGE_DEFAULT. The algorithm to resolve multiple samples to one pixel is implementation dependent. Resolve shares some of the restrictions of Copy, such as both Resources must be the same type (ie. Texture2D), and no strecting. Only a whole Subresource can be resolved, so both Subresources must be the same dimensions. Format conversion is not desired for ResolveSubresource either. However, due to typeless Resources, there is an interesting interaction with either Resource Format. If each Resource is prestructured+typed, then both Resources must have the same Format; and that must match the passed in ResolveFormat (ie. all R32_FLOAT). If one Resource is prestructured+typeless, then the prestructured+typed Resource's format must be compatable with the typeless format; and the ResolveFormat must match the prestructured+typed format (ie. Src: R32_TYPELESS, Dst & ResolveFormat: R32_FLOAT). If both Resource are prestructured+typeless, then they must be equal formats, and the ResolveFormat may be any format compatable with the typeless format and supporting resolve. (ie. Src & Dst: R32_TYPELESS -> ResolveFormat must be R32_FLOAT).
Further discussion on format interpretations and Multisample Resolve can be found in the Multisample Format Support(19.2) section.
Multisample resolve is performed in linear space, so conversion to linear for sRGB formats is performed prior to any arithmetic operations on the resource data, similar to the requirement for conversion to linear prior to filtering and blending arithmetic operations.
typedef struct D3D11DDIARG_RESOLVESUBRESOURCEIN { D3D11DDI_HRESOURCE hDstResource; // in: resource identifier UINT DstSubresource; // in: subresource index D3D11DDI_HRESOURCE hSrcResource; // in: resource identifier UINT SrcSubresource; // in: subresource index DXGI_FORMAT ResolveFormat; // in: resolve format } D3D11DDIARG_RESOLVESUBRESOURCEIN; // part of user mode Device interface: STDMETHOD( ResolveSubresource )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_RESOLVESUBRESOURCEIN* pResolveSubresourceIn ) = 0;
This operation identifies a Read-after-Write Hazard on a Resource granularity throughout the usage of a Device Context. This operation will be sent to the driver immediately before the Resource is used as an input in the graphics pipeline, as this is when the hazard is detected. For example, as a Render Target/ Texture transitions from a Render Target to a Texture, FlushResource will identify this transition immediately before the Resource is set as a Texture. FlushResource will identify the Resource, as a whole, and not the individual Subresources involved. It is expected that this operation detects when GPU caches need to be flushed.
When the pipeline is configured to read from non-overlapping Subresources that are being written to, at the same time non-overlapping Subresources are being read from, FlushResource operations will not be sent for such a Resource. So, the driver should not rely on notifications for this type of condition, as it doesn't appear there is really a Read-after-Write Hazard.
Additionally, FlushResource should not be expected to be used for to identify any hazards related to shared Resources: same-process cross-Device Context Resources nor cross-process Resources. Whenever a Device Context is swapped for another Device Context, GPU caches should be flushed, as needed, to maintain correct behavior. The only hazards FlushResource exposes are within the same device context.
// part of user mode Device interface: STDMETHOD( FlushResource )( D3D10DDI_HDEVICE hDrvDevice, D3D11DDI_HRESOURCE hDrvResource ) = 0;
If a Subresource was created with flags preventing the CPU to map/ lock and write to the Resource, the Subresource may still be able to be modified with UpdateSubresourceUP, as these concepts are mutually exclusive.
UpdateSubresourceUP may not be used when the Resource was created with flags allowing the CPU to map/ lock the Resource. It also may not be used with Resources that can be used as Depth/ Stencil, nor for multisampled Resources.
Partial updates of ConstantBuffers are disallowed, so when modifying ConstantBuffers with UpdateSubresourceUP, the update box will always be NULL.
UpdateSubresource works with structured buffers as a destination. The source data is interpreted as an array of structures of the destination’s stride. If necessary, any conversion of the data to a different layout must happen during the update process. It is only valid to update ranges of complete structures. If the bounds of the region being updated are not a range of complete structures, the runtime will fail the update operation.
typedef struct D3D11DDIARG_UPDATESUBRESOURCEUPIN { D3D11DDI_HRESOURCE hDstResource; // in: resource identifier UINT32 DstSubresource; // in: zero based subresource index CONST D3D11_BOX* pDstBox; // in: update box CONST VOID* pSrcUPData; // in: data pointer SIZE_T SrcPitch; // in: data pitch SIZE_T SrcSlicePitch; // in: data slice pitch } D3D11DDIARG_UPDATESUBRESOURCEUPIN; // part of user mode Device interface: STDMETHOD( UpdateSubresourceUP )( D3D10DDI_HDEVICE hDrvDevice, CONST D3D11DDIARG_UPDATESUBRESOURCEUPIN* pUpdateSubresourceUPIn ) = 0;
This is a new variant of the UpdateSubresource() and CopySubresourceRegions APIs (which both update a portion of a GPU surface) for D3D1.1. The addition is a Flags field where NO_OVERWRITE or DISCARD can be specified. A separate new feature that also affects UpdateSubresource is that it now allows overlapping copies.
void UpdateSubresource1( ID3D11Resource* pDstResource, UINT DstSubresource, const D3D11_BOX* pDstBox, const void* pSrcData, UINT SrcRowPitch, UINT SrcDepthPitch UINT CopyFlags ); // new CopyFlags parameter where D3D11_COPY_NO_OVERWRITE, // D3D11_COPY_DISCARD, or nothing can be specified. void CopySubresourceRegion1( ID3D11Resource* pDstResource, UINT DstSubresource, UINT DstX, UINT DstY, UINT DstZ, ID3D11Resource* pSrcResource, UINT SrcSubresource, const D3D11_BOX* pSrcBox, UINT CopyFlags ); // new CopyFlags parameter where D3D11_COPY_NO_OVERWRITE, // D3D11_COPY_DISCARD, or nothing can be specified.
Specifying NO_OVERWRITE means that the system can assume that existing references to the surface that may be in flight on the GPU will not be affected by the update, so the copy can proceed immediately (avoiding either a batch flush or the system maintaining multiple copies of the resource behind the scenes).
DISCARD means that the system may discard the entire contents of the destination memory outside the region being updated.
Before the first call with NO_OVERWRITE on a deferred context, a DISCARD must be done on the same context (via Copy*()/Update*()/Map() API flag or Discard*() API). This is not required on immediate contexts if the application knows the GPU is finished with the resource (though discard can be used if not).
Tile based deferred renderering (TBDR) GPUs might particularly benefit from this. They are always running multiple passes over the same command buffer, so any resource that is updated in the middle of rendering has to be maintained in the driver in a before and after state, or the tiling pass has to end before the resource update is performed (which is a very expensive tile flush operation).
These APIs will drive not only the D3D11.1 DDI but also D3D9 DDIs. So new drivers for any DX9+ hardware would have to support/understand revised BLT, BUFBLT, VOLBLT and TEXBLT DDIs adding the flags discussed here.
These are also required to be supported for all D3D10+ hardware with D3D11.1 drivers.
The implementation of system to video blts is critical for good performance in Direct2D text rendering. Drivers that expose the cap bit indicating that they are a tile-based renderer will see encounter the following situation during Direct2D text rendering:
When drivers encounter this scenario, they should implement the copy with the CPU synchronously. The NoOverWrite or Discard flag specified in the blt call can be used by the driver to map the destination surface for CPU access. These flags also enable drivers to implement this blt without a mid-scene flush. Drivers that implement this blt asynchronously (with either the CPU or the GPU) will see slowdowns when Direct2D attempts to map the system memory surface in the future.
Drivers on immediate-mode GPUs are free to implement system to video blts asynchronously.
DiscardResource() and DiscardView() API/DDIs (the latter allowing rects to be specified) allow applications to specify the contents of a resource (or the subset of it that is in a View) may be discarded. This is be reflected in both the D3D11.1 and D3D9 DDIs. The D3D9 DDI does not have Views, but does support limited subsetting of resources, so that is reflected in the new D3D9 Discard DDI (documented elsewhere).
On some GPUs with tile based deferred rendering (TBDR) architectures, binding RenderTargets that already have contents in them (from previous rendering) incurs a cost for having to copy the RenderTarget contents back into tile memory for rendering. If the application knows it is going to cover the entire surface anyway with new data, the copy is not needed.
On TBDRs a copy from tile memory back out can sometimes also be avoided. For example if a Multisampled RTV is Resolve()'d and then Discard()ed, the implementation may be able to resolve as each tile is finished wihtouth having to write out the full multisampled tile data. Specifying Discard() right away rather than waiting to specify discard on binding the resource later requires less look-ahead for the driver to know what it can do.
Multi-GPU systems can also benefit from discard semantics, such as in cases where separate frames are rendered on different GPUs, avoiding the need for cross-GPU data copies.
Section Contents
(back to chapter)
5.8.1 Intro
5.8.2 API Access
5.8.3 Mipmap Number Space
5.8.4 Fractional Clamping
5.8.5 Empty-Set Cases
5.8.6 Per-Resource Clamp Examples
D3D11 includes a way for applications to prevent some of the mipmaps in a resource from being accessible via the 3D pipeline (by clamping the mipmaps). This mechanism operates per-resource, as opposed to per-sampler(7.18.2) or per-ShaderResourceView, allowing applications a convenient way to globally control the GPU memory footprint that is referenced at any point. Drivers can easily take advantage of these per-resource clamps since they know that clamped off miplevels do not have to be resident in GPU memory.
Each resource (such as a texture2D) that an application creates will have a method on its interface that queues a D3D command setting a float32 scalar global MinLOD clamp for all Shader Resource Views of that resource. The fact that the command is queued means it does not affect the behavior of anything ahead of it in the queue.
Recall that lower LOD values define the more detailed mipmaps in a mipmap chain, so applying a MinLOD clamp has the effect of clamping off the most detailed miplevel(s).
The per-resource global MinLOD clamp applies to any reference to the resource from a shader via a Shader Resource View, such as using sample* or ld*instructions. Note that Sampler(7.18.2) objects already contain a fixed MinLOD and MaxLOD clamp, honored by instructions that take a Sampler as an operand such as sample*. The per-resource MinLOD clamp has the same effect as the Sampler MinLOD clamp (both clamps are applied), except each has a different number space for identifying mipmaps.
The per-resource MinLOD clamp considers the most detailed mipmap on the resource as LOD 0, so specifying a MinLOD clamp of 1 causes miplevel 0 on the resource to be ignored. On the other hand, the Sampler’s MinLOD clamp defines most detailed mipmap in the current Shader Resource View as LOD 0. So on a Shader Resource View that, for example, limits a mipmap chain to exclude the most detailed 3 mips from a resource, setting the Sampler MinLOD to 1 causes miplevel [3] (the fourth mip) in the resource to be ignored.
The per-resource MinLOD clamp can be fractional (like the Sampler(7.18.2) MinLOD clamp) – this is useful with linear mipmap filtering. For example suppose the per-resource MinLOD clamp is 1.1, and the current Shader Resource View is the entire mipchain. Texture filters would behave as if the most detailed mipmap available is a blend of 90% of mipmap [1] and 10% of mipmap [2]. Both mipmap [1] and [2] would have to be resident on the GPU. A way to make use of the fractions is to start with a high MinLOD clamp (limiting the memory footprint enough to prevent stalling on texture upload to the GPU), and gradually lowing the MinLOD clamp on the resource over time, allowing the driver/hardware more time to make all of the resource resident. Visually there would be no popping, as the influence of more detailed mipmaps is blended in.
A fractional per-resource MinLOD clamp basically requires the floor of the MinLOD miplevel and the less detailed miplevels to be resident. In the example above with a per-resource MinLOD clamp of 1.1, if a ld instruction requests data from miplevel [1], it will be resident.
As another example, consider the same Shader Resource View with a full mipchain, but a MinLOD clamp of 0.1. The gather4(22.4.2) instruction is defined to operate on mip 0 in the view only (otherwise an out of bounds result is returned). But since the clamp of 0.1 requires mip 0 to be present, gather4 will fetch from mip 0.
Suppose a ShaderResourceView on a resource is defined which limits the miplevels visible in the resource. Now suppose a per-resource MinLOD clamp is set such that the intersection of the remaining active miplevels after the clamp, with the miplevels used in a ShaderResourceView, is empty. e.g using a ShaderResourceView of mipmaps 0..3 on a resource along with a resource MinLOD clamp of 5. The result of fetching from the ShaderResourceView with such an empty intersection with the per-resource clamp is the defined out-of-bounds access result. That is, 0 is returned for all non-missing components of the format of the resource, and the default is provided for missing components. The lod(22.5.6) instruction returns 0 for the clamped LOD in this empty-set case.
If a texture has 6 mip levels (0..5) and the MinLOD clamp is set to any value past the least detailed mip in the view (e.g. 5.1), the out of bounds behavior applies. This is an exception to the rule that the floor of the MinLOD clamp is required to be present.
Shader ld*(22.4.6) instructions, which do not perform filtering, and which access miplevels directly, also honor the per-resource MinLOD clamp. This is unlike the MinLOD clamp in Sampler state, since ld* instructions do not use samplers. The previous section has an example illustrating how ld behaves with a fractional clamp.
If sample*(22.4.15) instructions that explicitly provide a miplevel to fetch from, such as sample_l(22.4.18), request a miplevel that is clamped off by a per-resource MinLOD clamp (where the per-resource clamp still falls within the View), the result of the fetch is the same as what happens with sampler clamping; that is the most detailed available clamped mip (after both sampler and MinLOD clamp) is used.
When sampling using a Sampler(7.18.2) configured to use BorderColor, accessing the border region of a mipmap that has been clamped off due to MinLOD clamp, the result is the out of bounds behavior (as opposed to returning the border color).
Initial Conditions:
Resource: 8 miplevels [0..7] Shader Resource View: [1..6] (so mip 0 in the view is mip 1 on the resource. In View space this is [0..5]) Sampler MinLOD = 1.2, MaxLOD = 4 (this is in the View mip number space) Sampler filter mode: MIN_MAG_MIP_LINEAR Per-Resource MinLOD clamp = 3.5 (this is in the Resource mip number space)
Some results:
Initial Conditions:
Resource: 8 miplevels [0..7] Shader Resource View: [1..6] (so mip 0 in the view is mip 1 on the resource. In view space this is [0..5]) Sampler MinLOD = 1.2, MaxLOD = 4 (this is in the View mip number space) Sampler filter mode: MIN_MAG_MIP_LINEAR Per-Resource MinLOD clamp = 5.5 (this is in the Resource mip number space)
Some results:
Initial Conditions:
Resource: 8 miplevels [0..7] Shader Resource View: [1..6] (so mip 0 in the view is mip 1 on the resource. In view space this is [0..5]) Sampler MinLOD = 1.2, MaxLOD = 4 (this is in the View mip number space) Sampler filter mode: MIN_MAG_MIP_LINEAR Per-Resource MinLOD clamp = 6.5 (this is in the Resource mip number space)
Some results:
Per-resource MinLOD clamps only affect the behavior of ShaderResourceView accesses from shader code – such as sample* and ld*instructions discussed so far.
Other operations on the resource are unaffected by per-resource MinLOD clamps, including reading and/or writing via RenderTargetViews, DepthStencilViews, or resource manipulation APIs such as CopySubresourceRegion, UpdateResource or GenerateMips. Any such reference to the contents of a resource, i.e. NOT through a ShaderResourceView, requires the system to make appropriate memory resident for the requested operation to proceed as expected, unaffected by per-resource MinLOD clamping.
The behavior of the resinfo instruction wrt. Per-resource MinLOD clamp is defined within the instruction's definition(22.4.14).
Section Contents
(back to chapter)
5.9.1 Overview
This spec is for "Tiled Resources" in D3D. Other terms that have been used for the same concept are "Sparse Textures" and "Partially Resident Textures"
This document outlines what might be expected of D3D implementations if this hypothetical feature was included in a future version of D3D.
Recall that all D3D memory allocations are managed at subresource granularity (in a system without Tiled Resource support). For a Buffer, the entire Buffer is the subresource. For a Texture, each mip level is a subresource (at a given array slice if it is a Texture Array). The graphics system (OS, driver, hardware) only expose the ability to manage the mapping of allocations at this subresource granularity. "Mapping", in the context of Tiled Resources in this spec, refer to making data visible to the GPU.
Suppose an application knows that a particular rendering operation only needs to access a small portion of an image mipmap chain (perhaps not even the full area of a given mipmap). Ideally the system could be told about this and only bother to ensure that the needed memory is mapped on the GPU without paging in too much. In reality, the system can only be informed about what memory needs to be mapped on the GPU at subresource granularity (i.e. a range of full mipmap levels that could be accessed). There is no demand faulting in the graphics system either, so potentially a lot of excess GPU memory needs to be used make full subresources mapped before a rendering command that references any part of the memory is executed. This is just one issue that makes the use of large memory allocations difficult in D3D.
D3D11 supports Texture2D surfaces with up to 16384 pixels on a given side. An image that is 16384 wide by 16384 tall and 4 bytes per pixel would consume 1GB of video memory (and adding mipmaps would double that). In practice it is unlikely/rare that all 1GB would need to be referenced in a single rendering operation.
Some game developers are now modeling terrain surfaces as large as 128K by 128K. The way they get this to work on existing GPUs is to break the surface into tiles that are small enough for hardware to handle. The application must figure out which tiles might be needed and load them into a cache of textures on the GPU - a software paging system. A significant downside to this approach comes from the hardware not knowing anything about the paging that is going on: When a part of an image needs to be shown on screen that straddles tiles, the hardware does not know how to perform fixed function (i.e. efficient) filtering across tiles. This means the application managing its own software tiling must resort to manual texture filtering in shader code (which becomes very expensive if a good quality anisotropic filter is desired) and/or waste memory authoring gutters around tiles that contain data from neighboring tiles so that fixed function hardware filtering can continue to provide some assistance.
If a Tiled representation of surface allocations could be a 1st class feature in the graphics system, the application could tell the hardware which tiles to make available. So (a) less GPU memory is wasted storing regions of surfaces that the application knows will not be accessed, and (b) the hardware can understand how to filter across adjacent tiles, alleviating some of the pain experienced by developers doing software tiling today.
But to provide a complete solution, something must be done to deal with the fact that, independent of whether tiling within a surface is supported, the maximum surface dimension is currently 16384 - nowhere near the 128K+ that applications already want. Just requiring the hardware to support larger texture sizes is one approach, however there are significant costs and/or tradeoffs to going this route. D3D11's texture filter path and rendering path are already saturated in terms of precision in supporting 16K textures with the other requirements, such as supporting viewport extents falling off the surface during rendering, or supporting texture wrapping off the surface edge during filtering. A possibility is to define a tradeoff such that as the texture size increases beyond 16K, functionality/precision is given up in some manner. Even with this concession however, additional hardware costs may be required in terms of addressing capability thoughout the hardware system to go to larger texture sizes.
One issue that comes into play as textures get very large is that single precision floating point texture coordinates (and the associated interpolators to support rasterization) run out of precision to specify locations on the surface accurately. Jittery texture filtering would ensue. One expensive option would be to require double precision interpolator support, though that could be overkill given a reasonable alternative - discussed later.
Regardless of whether the supported texture size may be increased above 16K, if there is some limit that is arrived at that is not magnitudes larger, the question would still remain: What if the application wants a surface even larger than whatever limit is in place? A reasonable approach could be to "Quilt" these large textures manually, independent of the Tiling within each texture. This document covers an approach along these lines. This might also mitigate a lack of double precision attribute interpolation.
The reason for one of the alternate names for this is "Sparse Texture" is that "Sparse" conveys both the Tiled nature of the resources as well as the perhaps the primary reason for Tiling them - that not all of them are expected to be mapped at once. In fact, it is conceivable that an application could author a Sparse/Tiled Resource in which no data is authored for all regions+mips of the resource, intentionally. So the content itself could be sparse, and the mapping of the content in GPU memory at a given time would be a subset of that (even more sparse).
Another scenario that could be served by Tiled Resources is enabling multiple Resources of different dimensions/formats to share the same memory. Sometimes applications have exclusive sets of resources that are known not to be used at the same time, or resources that are created only for very brief use and then destroyed, followed by creation of other resources. A form of generality that can fall out of "Tiled Resources" is that it is possible to allow the user to point multiple different resources at the same (overlapping) memory. In other words, the creation and destruction of "resources" (which define a dimension/format etc.) can be decoupled from the management of the memory underlying the resources from the application's point of view.
The rest of this section dives into the details required to define "Tiled Resources" in the context of D3D.
To create a Tiled Resource, the flag D3D11_RESOURCE_MISC_TILED has to be specified as a MiscFlag on the Create* call. Restrictions on when this flag can be used are described later.
Whereas a non-Tiled Resource's storage is allocated in the system when the resource is created (e.g. CreateTexture2D API call), for a Tiled Resource, the storage for the Resource contents is not allocated. Instead, when a Tiled Resource is created at the API, the system makes an address space reservation for the tiled surface's area only, and then allows the mapping of the tiles to be controlled by the application. The "mapping" of a tile is simply the physical location in memory that a logical tile in a resource points to (or NULL for an unmapped tile). This is not to be confused with the notion of mapping a D3D resource for CPU access, which despite using the same name is completely independent. The developer will be able to define and change the mapping of each tile individually as needed, knowing that all tiles for a surface don't need to be mapped at a time, thereby making effective use of the amount of memory available.
When the flag D3D11_RESOURCE_MISC_TILED is specified on a resource, the tiles that make up the resource come from pointing at locations in a Tile Pool. A Tile Pool is a pool of memory (backed by one or more allocations behind the scenes - unseen by the application) that simple to manage by the operating system / driver and whose memory footprint is easily understood by an application. Tiled Resources map 64KB regions by pointing to locations in a Tile Pool. One fallout of this setup is it allows multiple Resources to share/reuse the same tiles, and also for the same tiles to be reused at different locations within a Resource if desired.
The cost for the flexibility of populating the tiles for a Resource out of a Tile Pool is that the Resource has to do the work of defining and maintaining the mapping of which tiles in the Tile Pool represent the tiles needed for the Resource. Tile mappings can be changed. Also, not all tiles in a Resource need to be mapped at a time; it is a feature to be able to have NULL mappings - that is the definition of a tile not being available from the point of view of the Resource accessing it.
Multiple Tile Pools can be created, and any number of Tiled Resources can map into any given Tile Pool at the same time. Tile Pools can also be grown or shunk (see Resizing Tile Pools(5.9.2.2.2) for details). One constraint, existing merely to simplify driver and runtime implementation, is that a given Tiled Resource may only have mappings into at most one Tile Pool at a time (as opposed to having simultaneous mapping to multiple Tile Pools).
The amount of storage associated with a Tiled Resource itself (independent Tile Pool memory) should be roughly proportional to the number of tiles actually mapped to the pool at any given time. In hardware this boils down to scaling the memory footprint for page table storage roughly with the amount of tiles that are mapped (e.g. using a multilevel page table scheme as appropriate).
The Tile Pool can be thought of as an entirely software abstraction that enables D3D applications to effectively be able to program the page tables on the GPU without having to know the low level implementation details (or deal with pointer addresses directly). Tile Pools do no apply any additional levels of indirection in hardware. Optimizations of a single level page table using constructs like page directories are independent of the Tile Pool concept.
Let us explore what storage the page table itself could require in the worst case (though in practice implementations should only require storage roughly proportional to what is mapped).
Suppose each page table entry is 64 bits.
For the worst-case page table size hit for a single surface, given the resource limits in D3D11, suppose a Tiled Resource is created with a 128 bit-per-element format (e.g. RGBA float), so a 64KB tile contains only 4096 pixels. The maximum supported Texture2DArray size of 16384*16384*2048 (but with only a single mipmap) would require about 1GB of storage in the page table if fully populated (not including mipmaps) using 64 bit table entries. Adding mipmaps would grow the fully-mapped (worst case) page table storage by about a third, to about 1.3GB.
This would gives access to about 10.6 terabytes of addressable memory. There may will be a limit on the amount of addressable memory however, which would reduce these amounts, perhaps to around the terabyte range.
Another case to consider is a single Texture2D Tiled Resource of 16384*16384 with a 32 bit-per-element format, including mipmaps. The space needed in a fully populated page table would be roughly 170KB with 64 bit table entries.
Finally, consider an example using a BC format, say BC7 with 128 bits per tile of 4x4 pixels. That is one byte per pixel. A Texture2DArray of 16384*16384*2048 including mipmaps would require roughly 85MB to fully populate this memory in a page table. That is not bad considering this allows one Tiled Resource to span 550 gigapixels (512 GB of memory in this case).
In practice nowhere near these full mappings would be defined given that the amount of physical memory available wouldn't allow anywhere near that much to be mapped and referenced at a time anyway. With a tile pool, however, applications could choose to reuse tiles (as a simple example, reusing a "black" colored tile for large black regions in an image) - effectively using the Tile Pool (i.e. page table mappings) as a tool for memory compression.
The initial contents of the page table are NULL for all entries. Applications also can't pass initial data for the memory contents of the surface since it starts off with no memory backing.
Applications can create one or more Tile Pools per D3D device. The total size of a given Tile Pool is be restricted to D3D11's resource size limit, which is roughly 1/4 of GPU ram.
A Tile Pool is made of 64KB tiles, but the operating system (driver) manages the entire pool as one or more allocations behind the scenes - the breakdown is not visible to applications. Tiled Resources define content by pointing at tiles within a Tile Pool. Unmapping a tile from a Tiled Resource is done simply by pointing it to NULL. Such unmapped tiles have rules about the behavior of reads or writes (defined later).
A Tile Pool is created via the CreateBuffer API using a flag to indicate it is a tile pool.
A ResizeTilePool()(5.9.3.4) API allows a Tile Pool to be grown if the application needs more working set for the Tiled Resource(s) mapping into it, or shunk if less space is needed. Another options for applications is to allocate additional Tile Pools for new Tiled Resources, however if any singe Tiled Resource needs more space than initially available in its Tile Pool, growing the Tile Pool is a good option. A Tiled Resource can't have mappings into multiple Tile Pools at once.
When a Tile Pool is grown, additional Tiles are added to the end via one or more new allocations by the driver (breakdown into allocations not visible to the application). Existing memory in the Tile Pool is left untouched and existing Tiled Resource mappings into that memory remain intact.
When a Tile Pool is shrunk, tiles are removed from the end (this is allowed even below the initial allocation size, down to 0), meaning new mappings cannot be made past the new size. Existing mappings past the end of the new size, however, remain intact and useable, and Drivers will keep the memory around as long as mappings to any part of the allocation(s) the driver uses for the Tile Pool memory remains. If after shrinking, some memory has been kept alive because Tile Mappings are pointing to it and the Tile Pool is regrown, again (by any amount), the existing memory is reused first before any additional allocations occur to service the size of the grow operation.
To be able to save memory, an application has to not only shrink a Tile Pool but also remove/remap existing mappings past the end of the new smaller Tile Pool size.
The act of shrinking (and removing mappings) doesn't necessarily produce immediate memory savings. Freeing of memory depends on how granular the driver's underlying allocations for the Tile Pool are - when shrinking happens to be enough to make a driver allocation unused, the driver can free it. If a Tile Pool was grown, it is most likely that shrinking to previous sizes (and removing/remapping tile mappings correspondingly) will yield memory savings, though not guaranteed in the case that the sizes don't exactly align with the underlying allocation sizes chosen by the driver.
For non-Tiled Resources, D3D is able to prevent certain hazard conditions during rendering. For example, the D3D runtime does not allow any given SubResource to be bound as an input (such as a ShaderResourceView) and as an output (such as a RenderTargetView) at the same time. If such a case is encountered, the runtime unbinds the input. This tracking overhead in the runtime is cheap and is done at the SubResource level. One of the benefits of this is to minimize the chances of applications accidentally depending on hardware shader execution order - something that could vary if not on a given GPU, certainly would vary across different GPUs.
It may, however, be too expensive to do similar work on a per-tile level that may be necessary for Tiled Resources, since tracking would be at a tile level. New issues arise such as possibly validating away attempts to render to an RTV with one tile mapped to multiple areas in the surface simultaneously. If it turns out this per-tile hazard tracking is too expensive for the D3D runtime, ideally this would at least be an option in the Debug Layer.
Applications are required to inform the driver when it has issued a write or read to a tiled resource that refrences tile pool memory that will also be referenced by separate tiled resources in upcoming read or write operations and is expecting the first operations to complete before the second can begin. See the TiledResourceBarrier()(5.9.3.5) command.
There are some constraints on the type of D3D resources allowed to be created with the D3D11_RESOURCE_MISC_TILED flag. The valid parameters are:
Supported Resource Type: Texture2D[Array] (incl. TextureCube[Array], which is a variant of Texture2D[Array]), Buffer (not Texture1D[Array] or Texture3D - Texture3D expected for future).
Supported Resource Usage: D3D11_USAGE_DEFAULT (not: _DYNAMIC, _STAGING or _IMMUTABLE).
Supported Resource Misc Flags: D3D11_RESOURCE_MISC_TILED (by definition), _MISC_TEXTURECUBE, _DRAWINDIRECT_ARGS, _BUFFER_ALLOW_RAW_VIEWS, _BUFFER_STRUCTURED, _RESOURCE_CLAMP, _GENERATE_MIPS (not: _SHARED, _SHARED_KEYEDMUTEX, _GDI_COMPATIBLE, _SHARED_NTHANDLE, _RESTRICTED_CONTENT, _RESTRICT_SHARED_RESOURCE, _RESTRICT_SHARED_RESOURCE_DRIVER, _GUARDED, _TILE_POOL)
Supported Bind Flags: D3D11_BIND_SHADER_RESOURCE, _RENDER_TARGET, _DEPTH_STENCIL, _UNORDERED_ACCESS (not _CONSTANT_BUFFER, _VERTEX_BUFFER [note that binding a tiled Buffer as an SRV/UAV/RTV is still ok], _INDEX_BUFFER, _STREAM_OUTPUT, _BIND_DECODER, _BIND_VIDEO_ENCODER)
Supported Formats: All formats that would be available for the given configuration regardless of it being tiled, with some exceptions detailed elsewhere.
Supported SampleDesc (Multisample count, quality): Whatever would be supported for the given configuration regardless of it being tiled, with some exceptions detailed elsewhere.
Supported Width/Height/MipLevels/ArraySize:Full extents supported by D3D11. Tiled Resources do not have the restriction on total memory size imposed on non-Tiled Resources - they are only constrained by overall Virtual Address Space limits(5.9.2.3.1).
The initial contents of Tile Pool memory are undefined.
On 64 bit OSs, at least 40 bits of virtual address space (1 Terabyte) is available.
For 32 bit OSs, the address space is 32 bit. For 32 bit ARM systems, individual Tiled Resource creation can fail if the allocation would use more than 27 bits of address space (128 MB). This includes any hidden padding in the address space the hardware may use for mipmaps, packed tile padding, and possibly padding surface dimensions to powers of 2.
On systems with a separate page table for the GPU, most of this address space will be available to GPU resources made by the application, though GPU allocations made by the driver fit in the same space.
On future systems with a page table shared between the CPU and GPU, the available address space is shared between all CPU and GPU allocations in a process.
Tile Pools are defined by the following application specified properties (via the CreateBuffer API):
Size: Allocation size, as a multiple of 64KB (0 is valid since there is a Resize operation available).
Supported Resource Misc Flags: D3D11_RESOURCE_MISC_TILE_POOL (identifies it is a tile pool), D3D11_RESOURCE_MISC_SHARED, _SHARED_KEYEDMUTEX, _SHARED_NTHANDLE
Supported Resource Usage: D3D11_USAGE_DEFAULT only.
Tile Pools can be shared with other processes just like traditional resources. Tiled Resources (which reference Tile Pools) cannot be shared across devices/processes. However separate processes can create their own Tiled Resources that map to Tile Pool(s) shared between them.
Shared Tile Pools cannot be resized.
Formats containing stencil are not supported with Tiled Resources.
This includes DXGI_FORMAT_D24_UNORM_S8_UINT (and related formats in the R24G8 family) and DXGI_FORMAT_D32_FLOAT_S8X24_UINT (and related formats in the R32G8X24 family).
Some implementations store depth and stencil in separate allocations while others store them together. The problem is that tile management for the two schemes would have to be different, and effort has not gone into coming up with a way to abstract or rationalize the differences in a single API. A recommendation for future hardware is to support independent depth and stencil surfaces, each independently tiled. 32 bit depth would have 128x128 tiles and 8 bit stencil would have 256x256 tiles, so applications would have to live with tile shape misalignment between depth and stencil, but the same problem exists with different RenderTarget surface formats already.
Tile controls are available on immediate or deferred contexts (just like updates to normal Resources) and upon execution impact subsequent accesess to the tiles (not previously submitted operations).
Data cannot be copied to/from Tile Pool memory directly. Accesses to the memory are always done through Tiled Resources.
When a Tiled Resource is created, the dimensions, format element size and number of mipmaps and/or array slices (if applicable) determine the number of tiles that would be required to back the entire surface area. The pixel/byte layout within tiles is implementation-chosen (until such time as a standard layout is defined for future hardware). The number of pixels that fit in a tile, depending on the format element size, is fixed and identical whether using a (future) standard swizzle or not.
This means that the number of tiles that will be used by a given surface size and format element width is well defined/predictable based on the following tables. For Resources that contain mipmaps, or cases where surface dimensions don't fill a tile, however, there are some constraints, discussed later(5.9.2.8.5).
Different Tiled Resources can point to the same memory with different formats as long as applications don't rely on the results of writing to the memory with one format and reading with another, unless the formats are in the same format family (have the same typeless parent format) - e.g. R8G8B8A8_UNORM and R8G8B8A8_UINT are compatible with each other but not with R16G16_UNORM. There is one exception where bleeding data from one format aliasing to another is well defined: If a tile completely contains 0 for all its bits can be used with any format that interprets those memory contents as 0 (regardless of memory layout). So a tile could be cleared to 0x00 with the format R8_UNORM and then used with a format like R32G32_FLOAT and it would appear the contents are still (0.0f,0.0f).
The layout of data within a tile does not depend on where the tile is mapped in a resource overall. So, for example, a tile can be reused in different locations of a surface at once with consistent behavior in all locations.
(not counting tail mip packing)
Texture1D[Array] Tiled Resource support was designed as follows but not exposed for lack of utility.
Bits/Pixel | Tile Dimensions (Pixels) |
8 | 65536 |
16 | 32768 |
32 | 16384 |
64 | 8192 |
128 | 4096 |
BC1,4 | Not supported |
BC3,5,7 | Not supported |
Other format bit counts not supported with Tiled Resources: 96bpp formats, video formats, R1_UNORM, R8G8_B8G8_UNORM, G8R8_G8B8_UNORM.
(not counting tail mip packing)
Bits/Pixel (1 sample/pixel) | Tile Dimensions (Pixels, WxH) |
8 | 256x256 |
16 | 256x128 |
32 | 128x128 |
64 | 128x64 |
128 | 64x64 |
BC1,4 | 512x256 |
BC2,3,5,6,7 | 256x256 |
Other format bit counts not supported with Tiled Resources: 96bpp formats, video formats, R1_UNORM, R8G8_B8G8_UNORM, R8R8_G8B8_UNORM.
Multisample Count | Divide Tile Dimensions Above by (WxH) |
1 | 1x1 |
2 | 2x1 |
4 | 2x2 |
8 | 4x2 |
16 | 4x4 |
Only sample counts 1 and 4 are required (and allowed) to be supported with Tiled Resources. 2, 8, and 16 are shown for future consideration.
Implementations may choose to support 2, 8, and/or 16 sample MSAA for NON-Tiled Resources even though tiled resource don't support them.
Tiled Resources with sample counts larger than 1 cannot use 128bpp formats).
The constraints on supported sample counts and formats are due to hardware inconsistencies from the desired spec at the time of design.
(not counting tail mip packing)
This takes the Texture2D tiling divides the x/y dimensions by 4 each and adds 16 layers of depth. All the tiles for the first plane (2D plane of tiles defining the first 16 layers of depth) appear before the subsequent planes.:
Texture3D support in Tiled Resources is not exposed in the initial implementation of Tiled Resource, but the desired tile shapes are listed here for consideration in a future release.
Bits/Pixel (1 sample/pixel) | Tile Dimensions (Pixels, WxHxD) |
8 | 64x32x32 |
16 | 32x32x32 |
32 | 32x32x16 |
64 | 32x16x16 |
128 | 16x16x16 |
BC1,4 | 128x64x16 |
BC2,3,5,6,7 | 64x64x16 |
Other format bit counts not supported with Tiled Resources: 96bpp formats, video formats, R1_UNORM, R8G8_B8G8_UNORM, R8R8_G8B8_UNORM.
A Buffer Resource is trivially divided into 64KB tiles, with some empty space in the last tile if the size is not a multiple of 64KB.
Structured Buffers must have no constraint on the Stride to be Tiled, however possible performance optimizations in hardware for using Structured Buffers may be sacrificed by making them Tiled in the first place.
Depending on the Tier(5.9.7) of Tiled Resources support, mipmaps with certain dimensions do not follow the standard tile shapes and are considered to all be packed together with one another in a manner that is opaque to the application. Higher Tiers of support have broader guarantees about what types of surface dimensions fit in the standard tile shapes (and can therefore be individually mapped by applications).
What can vary between implementations is that - given a Tiled Resource's dimensions, format, number of mipmaps and array slices - some number M of mips (per array alice) may be packed into some number N tiles. The GetResourceTiling()(5.9.3.2) API exists to allow the driver to report to the application what M and N are (among other details about the surface that this API reports that are standard and do not vary by IHV). The set of tiles for the packed mips are still 64KB and can be individually mapped into disparate locations in a Tile Pool, however the pixel shape of the tiles and how the mipmaps fit across the set of tiles is IHV specific and too complex to expose. So applications are required to either map all of the tiles that are designated as packed, or none of them, at a time. Otherwise the behavior for accessing the Tiled Resource is undefined.
For arrayed surfaces, the set of packed mips and the number of packed tiles storing those mips (M and N described above) applies individually for each array slice.
Dedicated APIs for CopyingTiles(5.9.3.3) cannot access packed mips. Applications that wish to copy data to/from packed mips can do so using all the non-Tiled Resource specific APIs for copying and rendering to surfaces.
For the purposes of populating the contents of mipmapped Tiled Resources for mips that are non packed (use the standard tile shapes) from CPU memory (e.g. Staging memory or user data pointers), there is a well defined CPU-side layout for the tiling of all mipmaps independent of implementation (described in the Copying Tiles(5.9.3.3) section). Implementations can hide any differences in tile breakdown of mipmaps on the GPU side during Copy operations.
The following APIs allow manipulation and querying of tile mappings. Update calls only affect the tiles identified in the call, and others are left as defined previously.
Any given tile from a Tile Pool can be mapped to multiple locations in a Resource and even multiple Resources. This includes tiles in a Resource that have an implementation chosen layout, described earlier, where multiple mipmaps are packed together into a single tile. The catch is that if data is written to the tile via one mapping, but read via a differently configured mapping, the results are undefined. Careful use of this flexibility can still be useful for an application though, like sharing a tile between resources that will not be used simultaneously, where the contents of the tile are always initialized through the same Resource mapping as they will be subsequently read from. Similarly a tile mapped to hold the packed mipmaps of multiple different Resources with the same surface dimensions will work fine - the data will appear the same in both mappings.
Changes to tile assignments for a Resource can be made at any time in an immediate or deferred context.
// -------------------------------------------------------------------------------------------------------------------------------- // Data Structures for Manipulating Tile Mappings // -------------------------------------------------------------------------------------------------------------------------------- // For manipulating tile mappings, regions in tiled resources are described by a combination of: // (1) tiled resource coordinate (defining the corner of a region) and // (2) tile region size (defining the size of a region) // // These are separated into two structs rather than one so that the various APIs // that use them can use different combinations of the parts. typedef struct D3D11_TILED_RESOURCE_COORDINATE { // Coordinate values below index tiles (not pixels or bytes). UINT X; // Used for buffer, 1D, 2D, 3D UINT Y; // Used for 2D, 3D UINT Z; // Used for 3D UINT Subresource; // indexes into mips, arrays. Used for 1D, 2D, 3D // For mipmaps that use nonstandard tiling and/or are packed, any subresource // value that indicates any of the packed mips all refer to the same tile. }; typedef struct D3D11_TILE_REGION_SIZE { UINT NumTiles; BOOL bUseBox; // TRUE: Uses width/height/depth parameters below to define the region. // width*height*depth must match NumTiles above. (While // this looks like redundant information, the application likely has to know // how many tiles are involved anyway.) // The downside to using the box parameters is that one update region cannot // span mipmaps (though it can span array slices via the depth parameter). // // FALSE: Ignores width/height/depth parameters - NumTiles just traverses tiles in // the resource linearly across x, then y, then z (as applicable) then spilling over // mips/arrays in subresource order. Useful for just mapping an entire resource // at once, for example. // // In either case, the starting location for the region within the resource // is specified as a separate parameter outside this struct, using x,y,z coordinates // regardless of whether bUseBox above is TRUE or FALSE. // // When the region includes mipmaps that are packed with nonstandard tiling, // bUseBox must be FALSE, since tile dimensions are not standard and the application // only knows a count of how many tiles are consumed by the packed area (which is per // array slice). The corresponding (separate) starting location parameter uses x to // offset into the flat range of tiles in this case, and y,z coordinates must be 0. UINT Width; // In tiles, used for buffer, 1D, 2D, 3D UINT16 Height; // In tiles, used for 2D, 3D UINT16 Depth; // In tiles, used for 3D or arrays. For arrays, advancing in depth jumps to next slice // of same mip size, which is not contiguous in the subresource counting space // if there are multiple mips. }; typedef enum D3D11_TILE_MAPPING_FLAG { D3D11_TILE_MAPPING_NO_OVERWRITE = 0x00000001, } D3D11_TILE_MAPPING_FLAG; typedef enum D3D11_TILE_RANGE_FLAG { D3D11_TILE_RANGE_NULL = 0x00000001, D3D11_TILE_RANGE_SKIP = 0x00000002, D3D11_TILE_RANGE_REUSE_SINGLE_TILE = 0x00000004, } D3D11_TILE_RANGE_FLAG; // -------------------------------------------------------------------------------------------------------------------------------- // UpdateTileMappings // -------------------------------------------------------------------------------------------------------------------------------- // UpdateTileMappings adds/removes/changes mappings of tile locations in Tiled Resources to memory locations in a Tile Pool. // The API has several modes of operation to enable a few common tasks to be efficiently described. // // The basic orgainization of the parameters is as follows: // // (1) Tiled Resource whose mappings are being updated // (2) Set of Tile Regions on the Tiled Resource whose mappings to update. // (3) Tile Pool providing memory where tile mappings can go. // (4) Set of Tile Ranges where mappings are going: to the Tile Pool in (3), to NULL, and/or other options. // (5) Flags parameter for overall options // // More detailed breakdown of the parameters: // // (1) Tiled Resource whose mappings are being updated - resource created with the D3D11_RESOURCE_MISC_TILED flag. // Mappings start off all NULL when a resource is initially created. // // (2) Set of Tile Regions on the Tiled Resource whose mappings to update. One API call can update many mappings, // but an application can make multiple calls as well if that is more convenient (with a bit more API call overhead). // NumTiledResourceRegions specifies how many regions there are, pTiledResourceRegionStartCoordinates and // pTiledResourceRegionSizes are each arrays identifying the start location and extend of each region. // If NumTiledResourceRegions is 1, then for convenience either or both of the arrays describing the regions can // be NULL. NULL for pTiledResourceRegionStartCoordinates means the start coordinate is all 0's, and NULL for // pTiledResourceRegionSizes identifies a default region that is the full set of tiles for the entire Tiled Resource, // including all mipmaps and/or array slices. // // If pTiledResourceRegionStartCoordinates is not NULL and pTiledResourceRegionSizes is NULL, then the region // size defaults to 1 tile for all regions. This makes it easy to define mappings for a set of individual tiles // each at disparate locations by providing an array of locations in pTiledResourceRegionStartCoordinates without // having to send an array of pTiledResourceRegionSizes all set to 1. // // The updates are applied from first region to last, so if regions // overlap in a single call, the updates later in the list overwrite the areas overlapping with previous updates. // // (3) Tile Pool providing memory where mappings are pointing to. A Tiled Resource can point to a single Tile Pool // at a time. If a new Tile Pool is specified (for the first time or different // from the last time a Tile Pool was specified), all existing tile mappings for the Tiled Resource are cleared // and the new set of mappings in the current call are applied for the new Tile Pool. // If no Tile Pool is specified (NULL), or the same one as a previous call to UpdateTileMappings is provided, // the call just adds the new mappings to existing ones (overwriting on overlap). // If the call is only defining NULL mappings, no Tile Pool needs to be specified, since it doesn't matter. // But if one is specified anyway it takes the same behavior as described above when providing a Tile Pool. // // (4) Set of Tile Ranges where mappings are going to. Each given Tile Range can specify one of a few types of // ranges: a range of tiles in a Tile Pool (default), a count of tiles in the Tiled Resource to map to // to a single tile in a Tile Pool (sharing the tile), a count of tile mappings to in the Tiled Resource to skip // and leave as they are, or a count of tiles in the Tile Pool to map to NULL. // // NumRanges specifies the number of Tile Ranges, where the total tiles identified across all ranges // must match the total number of tiles in the Tile Regions from the Tiled Resource described above. // Mappings are defined by iterating through the tiles in the Tile Regions in sequential order - x then y // then z order for box regions - while walking through the set of Tile Ranges in sequential order. // The breakdown of Tile Regions doesn't have to line up with the breakdown of Tile Ranges // - all that matters is the total number of tiles on both sides is equal so that each Tiled Resource tile // specified has a mapping specified. // // pRangeFlags, pTilePoolStartOffsets and pRangeTileCounts are all arrays, of size NumRanges, describing the Tile // Ranges. If pRangeFlags is NULL, all ranges are sequential tiles in the Tile Pool, otherwise for each range i // pRangeFlags[i] identifies how the mappings in that range of tiles work: // // If pRangeFlags[i] is 0, that range defines sequential tiles in the Tile Pool, with the number of tiles being // pRangeTileCounts[i] and the starting location pTilePoolStartOffsets[i]. If NumRanges is 1, pRangeTileCounts // can be NULL and defaults to the total number of tiles specified by all the Tile Regions. // // If pRangeFlags[i] is D3D11_TILE_RANGE_REUSE_SINGLE_TILE, pTilePoolStartOffsets[i] identifies the single // tile in the Tile Pool to map to, and pRangeTileCounts[i] specifies how many tiles from the Tile Regions to // map to that Tile Pool location. If NumRanges is 1, pRangeTileCounts can be NULL and defaults to the total // number of tiles specified by all the Tile Regions. // // If pRangeFlags[i] is D3D11_TILE_RANGE_NULL, pRangeTileCounts[i] specifies how many tiles from the Tile Regions // to map to NULL. If NumRanges is 1, pRangeTileCounts can be NULL and defaults to the total // number of tiles specified by all the Tile Regions. pTilePoolStartOffsets[i] is ignored for NULL mappings. // // If pRangeFlags[i] is D3D11_TILE_RANGE_SKIP, pRangeTileCounts[i] specifies how many tiles from the Tile Regions // to skip over and leave existing mappings unchanged for. This can be useful if a Tile Region conveniently // bounds an area of Tile Mappings to update except with some exceptions that need to be left the same as // whatever they were mapped to before. pTilePoolStartOffsets[i] is ignored for SKIP mappings. // // (5) Flags: D3D11_TILE_MAPPING_NO_OVERWRITE means the caller promises that previously submitted commands to the // device that may still be executing do not reference any of the tile region being updated. // This allows the device to avoid having to flush previously submitted work in order to do the tile mapping // update. If the application violates this promise by updating tile mappings for locations in Tiled Resouces // still being referenced by outstanding commands, undefined rendering behavior results, including the potential // for significant slowdowns on some architectures. This is like the "no overwrite" concept that exists // elsewhere in the API, except applied to Tile Mapping data structure itself (which in hardware is a page table). // The absence of this flag requires that tile mapping updates specified by this call must be completed before any // subsequent D3D command can proceed. // // Return values: // // Returns S_OK, E_INVALIDARG, E_OUTOFMEMORY or DXGI_ERROR_DEVICE_REMOVED. E_OUTOFMEMORY can happen if the call results // in the driver having to allocate space for new page table mappings but running out of memory. // // If out of memory occurs when this is called in a CommandList and the CommandList is being executed, the device will be removed. // Applications can avoid this situation by only doing update calls that change existing mappings from Tiled Resources // within commandlists (so drivers will not have to allocate page table memory, only change the mapping). // // Validation remarks: // // The tile regions specified must entirely fit in the tiled resource or behavior is undefined (debug layer will emit an error). // The number of tiles in the tile regions must match the number of tiles in all the tile ranges otherwise the // call is dropped with E_INVALIDARG. Other parameter arrors also result in the call being dropped with E_INVALIDARG - the // debug layer provides explanations. // HRESULT ID3D11DeviceContext2:: UpdateTileMappings( _In_ ID3D11Resource* pTiledResource, _In_ UINT NumTiledResourceRegions, _In_reads_opt_(NumTiledResourceRegions) const D3D11_TILED_RESOURCE_COORDINATE* pTiledResourceRegionStartCoordinates, _In_reads_opt_(NumTiledResourceRegions) const D3D11_TILE_REGION_SIZE* pTiledResourceRegionSizes, _In_opt_ ID3D11Buffer* pTilePool, _In_ UINT NumRanges, _In_reads_opt_(NumRanges) const UINT* pRangeFlags, _In_reads_opt_(NumRanges) const UINT* pTilePoolStartOffsets, // 0 based tile offsets // counting in tiles (not bytes) _In_reads_opt_(NumRanges) const UINT* pRangeTileCounts, _In_ UINT Flags ); // ---------------------------------------------------------- // Here are some examples of common UpdateTileMappings cases: // ---------------------------------------------------------- // // ---------------------------------------------- // Clearing an entire surface's mappings to NULL: // ---------------------------------------------- // - No-overwrite is specified, assuming it is known nothing else the GPU could be doing is referencing the previous mappings // - NULL for pTiledResourceRegionStatCoordinates and pTiledResourceRegionSizes defaults to the entire resource // - NULL for pTilePoolStartOffsets since it isn't needed for mapping tiles to NULL // - NULL for pRangeTileCounts when NumRanges is 1 defaults to the same number of tiles as the tiled resource region (which is // the entire surface in this case) // // UINT RangeFlags = D3D11_TILE_MAPPING_NULL; // pDeviceContext2->UpdateTileMappings(pTiledResource,1,NULL,NULL,NULL,1,&RangeFlags,NULL,NULL,0,D3D11_TILE_MAPPING_NO_OVERWRITE); // // ------------------------------------------- // Mapping a region of tiles to a single tile: // ------------------------------------------- // - This maps a 2x3 tile region at tile offset (1,1) in a Tiled Resource to tile [12] in a Tile Pool // // D3D11_TILED_RESOURCE_COORDINATE TRC; // TRC.X = 1; // TRC.Y = 1; // TRC.Z = 0; // TRC.Subresource = 0; // // D3D11_TILE_REGION_SIZE TRS; // TRS.bUseBox = TRUE; // TRS.Width = 2; // TRS.Height = 3; // TRS.Depth = 1; // TRS.NumTiles = TRS.Width * TRS.Height * TRS.Depth; // // UINT RangeFlags = D3D11_TILE_MAPPING_REUSE_SINGLE_TILE; // UINT StartOffset = 12; // pDeviceContext2->UpdateTileMappings(pTiledResource,1,&TRC,&TRS,pTilePool,1,&RangeFlags,&StartOffset, // NULL,D3D11_TILE_MAPPING_NO_OVERWRITE); // // ---------------------------------------------------------- // Defining mappings for a set of disjoint individual tiles: // ---------------------------------------------------------- // - This can also be accomplished in multiple calls. Using a single call to define multiple // a single call to define multiple mapping updates can reduce CPU call overhead slightly, // at the cost of having to pass arrays as parameters. // - Passing NULL for pTiledResourceRegionSizes defaults to each region in the Tiled Resource // being a single tile. So all that is needed are the coordinates of each one. // - Passing NULL for Range Flags defaults to no flags (since none are needed in this case) // - Passing NULL for pRangeTileCounts defaults to each range in the Tile Pool being size 1. // So all that is needed are the start offsets for each tile in the Tile Pool // // D3D11_TILED_RESOURCE_COORDINATE TRC[3]; // UINT StartOffsets[3]; // UINT NumSingleTiles = 3; // // TRC[0].X = 1; // TRC[0].Y = 1; // TRC[0].Subresource = 0; // StartOffsets[0] = 1; // // TRC[1].X = 4; // TRC[1].Y = 7; // TRC[1].Subresource = 0; // StartOffsets[1] = 4; // // TRC[2].X = 2; // TRC[2].Y = 3; // TRC[2].Subresource = 0; // StartOffsets[2] = 7; // // pDeviceContext2->UpdateTileMappings(pTiledResource,NumSingleTiles,&TRC,NULL,pTilePool,NumSingleTiles,NULL,StartOffsets,NULL,D3D11_TILE_MAPPING_NO_OVERWRITE); // // ----------------------------------------------------------------------------------- // Complex example - defining mappings for regions with some skips, some NULL mappings // ----------------------------------------------------------------------------------- // - This complex example hard codes the parameter arrays, whereas in practice the // application would likely configure the paramaters programatically or in a data driven way. // - Suppose we have 3 regions in a Tiled Resource to configure mappings for, 2x3 at coordinate (1,1), // 3x3 at coordinate (4,7), and 7x1 at coordinate (20,30) // - The tiles in the regions are walked from first to last, in X then Y then Z order, // while stepping forward through the specified Tile Ranges to determine each mapping. // In this example, 22 tile mappings need to be defined. // - Suppose we want the first 3 tiles to be mapped to a contiguous range in the Tile Pool starting at // tile pool location [9], the next 8 to be skipped (left unchanged), the next 2 to map to NULL, // the next 5 to share a single tile (tile pool location [17]) and the remaining // 4 tiles to each map to to unique tile pool locations, [2], [9], [4] and [17]: // // D3D11_TILED_RESOURCE_COORDINATE TRC[3]; // D3D11_TILE_REGION_SIZE TRS[3]; // UINT NumRegions = 3; // // TRC[0].X = 1; // TRC[0].Y = 1; // TRC[0].Subresource = 0; // TRS[0].bUseBox = TRUE; // TRS[0].Width = 2; // TRS[0].Height = 3; // TRS[0].NumTiles = TRS[0].Width * TRS[0].Height; // // TRC[1].X = 4; // TRC[1].Y = 7; // TRC[1].Subresource = 0; // TRS[1].bUseBox = TRUE; // TRS[1].Width = 3; // TRS[1].Height = 3; // TRS[1].NumTiles = TRS[1].Width * TRS[1].Height; // // TRC[2].X = 20; // TRC[2].Y = 30; // TRC[2].Subresource = 0; // TRS[2].bUseBox = TRUE; // TRS[2].Width = 7; // TRS[2].Height = 1; // TRS[2].NumTiles = TRS[2].Width * TRS[2].Height; // // UINT NumRanges = 8; // UINT RangeFlags[8]; // UINT TilePoolStartOffsets[8]; // UINT RangeTileCounts[8]; // // RangeFlags[0] = 0; // TilePoolStartOffsets[0] = 9; // RangeTileCounts[0] = 3; // // RangeFlags[1] = D3D11_TILE_MAPPING_SKIP; // TilePoolStartOffsets[1] = 0; // offset is ignored for skip mappings // RangeTileCounts[1] = 8; // // RangeFlags[2] = D3D11_TILE_MAPPING_NULL; // TilePoolStartOffsets[2] = 0; // offset is ignored for NULL mappings // RangeTileCounts[2] = 2; // // RangeFlags[3] = D3D11_TILE_MAPPING_REUSE_SINGLE_TILE; // TilePoolStartOffsets[3] = 17; // RangeTileCounts[3] = 5; // // RangeFlags[4] = 0; // TilePoolStartOffsets[4] = 2; // RangeTileCounts[4] = 1; // // RangeFlags[5] = 0; // TilePoolStartOffsets[5] = 9; // RangeTileCounts[5] = 1; // // RangeFlags[6] = 0; // TilePoolStartOffsets[6] = 4; // RangeTileCounts[6] = 1; // // RangeFlags[7] = 0; // TilePoolStartOffsets[7] = 17; // RangeTileCounts[7] = 1; // // pDeviceContext2->UpdateTileMappings(pTiledResource,NumRegions,TRC,TRS,pTilePool,NumRanges,RangeFlags, // TilePoolStartOffsets,RangeTileCounts,D3D11_TILE_MAPPING_NO_OVERWRITE); // // -------------------------------------------------------------------------------------------------------------------------------- // CopyTileMappings // -------------------------------------------------------------------------------------------------------------------------------- // CopyTileMappings helps with tasks such as shifting mappings around within/across Tiled Resources, e.g. scrolling tiles. // The source and dest region can overlap - the result of the copy in this case is as if the source was saved to a temp and then // from there writen to the dest, though the implementation may be able to do better. // // If the dest resource has a different tile pool than the source, any existing mappings in the dest are cleared to NULL // and the mappings from the source are applied. This maintains the rule that a given resource can have mappings into // only one tile pool at a time. // // The Flags field allows D3D11_TILE_MAPPING_NO_OVERWRITE to be specified, means the caller promises that previously // submitted commands to the device that may still be executing do not reference any of the tile region being updated. // This allows the device to avoid having to flush previously submitted work in order to do the tile mapping // update. If the application violates this promise by updating tile mappings for locations in Tiled Resouces // still being referenced by outstanding commands, undefined rendering behavior results, including the potential // for significant slowdowns on some architectures. This is like the "no overwrite" concept that exists // elsewhere in the API, except applied to Tile Mapping data structure itself (which in hardware is a page table). // The absence of this flag requires that tile mapping updates specified by this call must be completed before any // subsequent D3D command can proceed. // // Return Values: // // Returns S_OK or E_INVALIDARG or E_OUTOFMEMORY. The latter can happen if the call results in the driver having to // allocate space for new page table mappings but running out of memory. // // If out of memory occurs when this is called in a commandlist and the commandlist is being executed, the device will be removed. // Applications can avoid this situation by only doing update calls that change existing mappings from Tiled Resources // within commandlists (so drivers will not have to allocate page table memory, only change the mapping). // // Various other basic conditions such as invalid flags or passing in non Tiled Resources result in call being dropped // with E_INVALIDARG. // // Validation remarks: // // The dest and the source regions must each entirely fit in their resource or behavior is undefined // (debug layer will emit an error). // HRESULT ID3D11DeviceContext2:: CopyTileMappings( _In_ ID3D11Resource* pDestTiledResource, _In_ const D3D11_TILED_RESOURCE_COORDINATE* pDestRegionStartCoordinate, _In_ ID3D11Resource* pSourceTiledResource, _In_ const D3D11_TILED_RESOURCE_COORDINATE* pSourceRegionStartCoordinate, _In_ const D3D11_TILE_REGION_SIZE* pTileRegionSize, _In_UINT Flags // The only flag that can be specified is: // D3D11_TILE_MAPPING_NO_OVERWRITE (see definition under UpdateTileMappings) );APIs for retrieving tile mappings from the device are not included (contrary to general D3D convention) because of the high cost and complexity to implement them in a performant way for what appears to be little value. Applications will have to track this state on their own. Tools scenarios are expected to simply track API state from the time the device was created.
// -------------------------------------------------------------------------------------------------------------------------------- // GetResourceTiling // -------------------------------------------------------------------------------------------------------------------------------- // GetResourceTiling retrieves information about how a Tiled Resource is broken into tiles. // typedef struct D3D11_SUBRESOURCE_TILING { // Each packed mip is individually reported as 0 for WidthInTiles, HeightInTiles and DepthInTiles. UINT WidthInTiles; UINT HeightInTiles; UINT DepthInTiles; // Total number of tiles in subresources is WidthInTiles*HeightInTiles*DepthInTiles UINT StartTileIndexInOverallResource; }; // D3D11_PACKED_TILE is filled into D3D11_SUBRESOURCE_TILING.StartTileIndexInOverallResource // for packed mip levels, signifying that this entire struct is meaningless (WidthInTiles, HeightInTiles, // DepthInTiles are also al set to 0). // For packed tiles, the description of the packed mips comes from D3D11_PACKED_MIP_DESC instead. const UINT D3D11_PACKED_TILE = 0xffffffff; typedef struct D3D11_TILE_SHAPE { UINT WidthInTexels; UINT HeightInTexels; UINT DepthInTexels; // Texels are equivalent to pixels. For untyped Buffer resources, a texel is just a byte. // For MSAA surfaces the numbers are still in terms of pixels/texels. // The values here are independent of the surface dimensions. Even if the surface is // smaller than what would fit in a tile, the full tile dimensions are reported here. }; typedef struct D3D11_PACKED_MIP_DESC { UINT NumPackedMips; // How many mips starting from the least detailed mip are packed (either // sharing tiles or using non standard tile layout). 0 if there no // such packing in the resource. For array surfaces this value is how many // mips are packed for a given array slice - each array slice repeats the same // packing. // Mipmaps that fill at least one standard shaped tile in all dimensions // are not allowed to be included in the set of packed mips. Mips with at least one // dimension less than the standard tile shape may or may not be packed, // depending on the IHV. Once a given mip needs to be packed, all coarser // mips for a given array slice are considered packed as well. UINT NumTilesForPackedMips; // If there is no packing this value is meaningless and returns 0. // Otherwise it returns how many tiles // are needed to represent the set of packed mips. // The pixel layout within the packed mips is hardware specific. // If applications define only partial mappings for the set // of tiles in packed mip(s), read/write behavior will be // IHV specific and undefined. // For arrays this only returns the count of packed mips within // the subresources for each array slice. UINT StartTileIndexInOverallResource; // Offset of the first packed tile for the resource // in the overall range of tiles. If NumPackedMips is 0, this // value is meaningless and returns 0. Otherwise it returns the // offset of the first packed tile for the resource in the overall // range of tiles for the resource. A return of 0 for // StartTileIndexInOverallResourcein means the entire resource is packed. // For array surfaces this is the offset for the tiles containing the packed // mips for the first array slice. // Packed mips for each array slice in arrayed surfaces are at this offset // past the beginning of the tiles for each array slice. (Note the // number of overall tiles, packed or not, for a given array slice is // simply the total number of tiles for the resource divided by the // resource's array size, so it is easy to locate the range of tiles for // any given array slice, out of which StartTileIndexInOverallResource identifies // which of those are packed.) }; void ID3D11Device2:: GetResourceTiling( _In_ ID3D11Resource* pTiledResource, _Out_opt_ UINT* pNumTilesForEntireResource, // Total number of tiles needed to store the resource _Out_opt_ D3D11_PACKED_MIP_DESC* pPackedMipDesc, // Mip packing details _Out_opt_ D3D11_TILE_SHAPE* pTileShape, // How pixels fit in tiles, independent of surface dimensions, // not including packed mip(s). If the entire surface is packed, // this parameter is meaningless since there is no defined layout // for packed mips. In this case the returned fields are set to 0. _Inout_opt_ UINT* pNumSubresourceTilings, // IN: how many subresources to query tilings for, // OUT: returns how many retrieved (clamped to what's available) _In_ UINT FirstSubresourceTilingToGet, // ignored if *pNumSubresourceTilings is 0, _Out_writes_(*pNumSubresourceTilings) D3D11_SUBRESOURCE_TILING* pSubresourceTilings, // Subresources that // are part of packed mips return 0 for all of the fields in // the corresponding output, except StartTileIndexInOverallResource which is // set to D3D11_PACKED_TILE (0xffffffff) - basically indicating the whole // struct is meaningless for this case and pPackedMipDesc applies. ); // -------------------------------------------------------------------------------------------------------------------------------- // CheckMultisampleQualityLevels1 // -------------------------------------------------------------------------------------------------------------------------------- // CheckMultisampleQualityLevel1 is a variant of the existing CheckMultisampleQualityLevels API that adds a flags field that // allows the caller to indicate the query is for a tiled resource. This allows drivers to report multisample quality levels // for tiled resources differently than non-Tiled resources. // // As with non-tiled Resources, when Multisampling is supported/required for a given format, applications are guaranteed to // be able to use the standard or center multisample patterns instead of using one of the driver quality levels. // typedef enum D3D11_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS { D3D11_CHECK_MULTISAMPLE_QUALITY_LEVELS_TILED_RESOURCE = 0x00000001, }; HRESULT ID3D11Device2:: CheckMultisampleQualityLevels1( _In_ DXGI_FORMAT Format, _In_ UINT SampleCount, _In_ UINT Flags, // D3D11_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS _Out_ UINT *pNumQualityLevels);
As mentioned, existing methods in D3D for moving data around work with Tiled Resources just as if they are not Tiled, except that writes to unmapped areas are dropped and reads from unmapped areas produce 0. If a copy involves writing to the same memory location multiple times because multiple locations in the destination resource are mapped to the same tile memory, the resulting writes to multi-mapped tiles are nondeterministic/nonrepeatable - accesses happen in whatever order the hardware happens to execute the copy.
This section describes methods for the following additional methods of copying:
(a) between tiles in a Tiled Resource (at 64KB tile granularity) and (to/from) a Buffer in GPU memory (or staging resource) - CopyTiles()
(b) from application provided memory to tiles in a Tiled Resource - UpdateTiles()
These methods swizzle/deswizzle as needed, and allow a D3D11_TILE_COPY_NO_OVERWRITE flag when the caller promises the destination memory is not referenced by GPU work that is in flight.
The tiles involved in the copy cannot include tiles containing packed mipmaps or results are undefined. To transfer data to/from mipmaps that the hardware packs into one tile, the standard (non-tile specific) Copy/Update APIs (or GenerateMips for the whole mip chain) must be used.
Using GenerateMips() on a resource with partially mapped tiles will produce results that simply follow the rules for reading and writing NULL applied to whatever algorithm the hardware/driver happens to use to GenerateMips(). So it is not particularly useful for an application to bother doing this unless somehow the areas with NULL mappings (and their effect on other mips during the generation phase) will have no consequence on the parts of the surface the application does care about.
Copying tile data from a staging surface or from application memory would be the way to upload tiles that may have been streamed off disk, for example. A variation when streaming off disk is uploading some sort of compressed data to GPU memory and then decoding on the GPU. The decode target could be a buffer resource in GPU memory, from which CopyTiles() then copies to the actual Tiled Resource. This copy step allows the GPU to swizzle when the swizzle pattern is not known. Swizzling is not needed if the Tiled Resource itself is a Buffer resource (e.g. as opposed to a Texture).
The memory layout of the tiles in the non-tiled Buffer resource side of the copy is simply linear in memory within 64KB tiles, which the hardware/driver would swizzle/deswizzle per tile as appropriate when transferring to/from a Tiled Resource. For MSAA surfaces, each pixel's samples are traversed in sample-index order before moving to the next pixel. For tiles that are partially filled on the right side (for a surface that has a width not a multiple of tile width in pixels), the pitch/stride to move down a row is the full size in bytes of the number pixels that would fit across the tile if the tile was full. So there can be a gap between each row of pixels in memory. For specification simplicity, mipmaps smaller than a tile are not packed together in the linear layout. This seems to be a waste of memory space, but as mentioned copying to mips that the hardware packs together is not allowed via CopyTiles() or UpdateTiles(). The application can just use generic UpdateSubresource*() or CopySubresource*() APIs to copy small mips individually, though in the case of CopySubresource*() that means the linear memory has to be the same dimension as the Tiled Resource - CopySubresource*() can't copy from a Buffer resource to a Texture2D for instance.
If a hardware standard swizzle is defined, flags could be added indicate that the data in the Buffer is to be interpreted in that format (no swizzle necessary on transfer), though alternative approaches to uploading data may also make sense in that case such as allowing allowing applications direct access to Tile Pool memory.
Copying operations can be done on an immediate or deferred context.
typedef enum D3D11_TILE_COPY_FLAGS { D3D11_TILE_COPY_NO_OVERWRITE = 0x00000001, // D3D11_TILE_COPY_NO_OVERWRITE indicates that the application promises // the GPU is not currently referencing any of the // portions of destination memory being written. D3D11_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE = 0x00000002, // D3D11_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE means copy tile data from the // specified buffer location, reading tiles sequentially, // to the specified tile region (in x,y,z order if the region is a box), // swizzling to optimal hardware memory layout as needed. // In this case the source data is pBuffer and the destination is pTiledResource D3D11_TILE_COPY_SWIZZLED_TILED_RESOURCE_TO_LINEAR_BUFFER = 0x00000004, // D3D11_TILE_COPY_SWIZZLED_TILED_RESOURCE_TO_LINEAR_BUFFER means copy tile data from the // tile region, reading tiles sequentially (in x,y,z order if the region is a box), // to the specified buffer location, deswizzling to linear memory layout as needed. // In this case the source data is pTiledResource and the destination is pBuffer }; // -------------------------------------------------------------------------------------------------------------------------------- // CopyTiles // -------------------------------------------------------------------------------------------------------------------------------- // Copy from buffer to tiled resource or vice versa. void ID3D11DeviceContext2:: CopyTiles( _In_ ID3D11Resource* pTiledResource, _In_ const D3D11_TILED_RESOURCE_COORDINATE* pTileRegionStartCoordinate, _In_ const D3D11_TILE_REGION_SIZE* pTileRegionSize, _In_ ID3D11Buffer* pBuffer, // Default, dynamic or staging buffer _In_ UINT64 BufferStartOffsetInBytes, _In_ UINT Flags // D3D11_TILE_COPY_FLAGS ); // -------------------------------------------------------------------------------------------------------------------------------- // UpdateTiles // -------------------------------------------------------------------------------------------------------------------------------- // Copy from application memory to tiled resource. void ID3D11DeviceContext2:: UpdateTiles( _In_ ID3D11Resource* pDestTiledResource, _In_ const D3D11_TILED_RESOURCE_COORDINATE* pDestTileRegionStartCoordinate, _In_ const D3D11_TILE_REGION_SIZE* pDestTileRegionSize, _In_ const void* pSourceTileData, // caller memory _In_ UINT Flags // D3D11_TILE_COPY_FLAGS: // Valid options: D3D11_TILE_COPY_NO_OVERWRITE // (the other flags aren't meaningful here, though // by definition the flag D3D11_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE // is basically what UpdateTiles does, sourcing from application memory. );
// -------------------------------------------------------------------------------------------------------------------------------- // ResizeTilePool // -------------------------------------------------------------------------------------------------------------------------------- // Resize a Tile Pool. See Resizing Tile Pools(5.9.2.2.2) for discussion, including specifics about what // shrinking means. // // New Tile Pool size must be a multiple of 64KB (or 0) otherwise the call returns E_INVALIDARG. // On out of memory the call returns E_OUTOFMEMORY. For either of these failures, the existing Tile Pool remains unchanged, // including existing mappings. DXGI_ERROR_DEVICE_REMOVED is the other possible error code. S_OK for success. // HRESULT ID3D11DeviceContext2:: HRESULT ResizeTilePool( _In_ ID3D11Buffer* pTilePool, _In_ UINT64 NewSizeInBytes );
// -------------------------------------------------------------------------------------------------------------------------------- // TiledResourceBarrier // -------------------------------------------------------------------------------------------------------------------------------- // With Tiled Resources applications have a lot of freedom to reuse tiles in different resources. Sometimes it may not be clear // to a device/driver, without unreasonable tracking overhead, that some memory in a tile pool that was just written to is // now being used for reading (so caches may have to be flushed or a bubble might have to be introduced in the pipeline depending // on the timing in order to generate correct results). // // As an example, an application may copy to some tiles in a Tile Pool via one Tiled Resource but then read from the same // tiles using a different Tiled Resource. This is different from using the same resource object first as a destination for // copying data and then as a source via ShaderResourceView read (which drivers can already tell must be kept in order). // // In full detail, the requirement of an application is as follows: When an application transitions from accessing (reading or writing) // some location in a Tile Pool with one subresource (e.g. mip slice) to accessing the same memory (read or write) via another subresource // or different Tiled Resource, in a way that would not be obvious to drivers (because they do not need to bother keeping track of where // tiles are being shared), the application must call TiledResourceBarrier after the first access to the resource and before the second // different method of access. Calling TiledResourceBarrier isn't required if both accesses are reads. The parameters are the // TiledResource that was accessed before the Barrier and the the TiledResource that will be accessed after the Barrier using the same // Tile Pool memory. If the resources and subresources involved are the same, the API doesn't need to be called, as drivers track // hazards at the subresource level on their own, cheaply. // // The Barrier call informs the driver that operations issued to the resource before the call must complete before any accesses that // occur after the call via different Tiled Resource that shares the same memory. // // Either or both of the parameters (before or after the barrier) can be NULL. NULL before the barrier means // all tiled resource accesses before the barrier that have mappings into the Tile Pool that the resource after the barrier maps to // must complete before the resource specified after the barrier can be referenced by the GPU. NULL after the barrier means // that any Tiled resources access after the barrier with mappings to the Tile Pool that the resource before the barrier maps // to can only be executed by the GPU after accesses to the tiled resource before the barrier are finished. Both NULL means all // previous tiled resource accesses are complete before any subsequent tiled resource access may proceed (for all Tile Pools). // // Either a view pointer, a resource or NULL can be passed for each parameter. Views are allowed both for // convenience but also to allow scoping of the barrier effect to a relevant portion of a resource. // // Rendering commands that the driver/hardware can tell are completely independent of the tiled resources identified in this // call are unconstrained in their order of execution with respect to accesses to the identified tiled resources and the barrier. // If exploiting reordering could produce visible side effects (given appropriate barriers were specified) // it is an invalid reordering by the system/hardware. // void ID3D11DeviceContext2:: TiledResourceBarrier( _In_opt_ ID3D11DeviceChild* pTiledResourceOrViewAccessBeforeBarrier, _In_opt_ ID3D11DeviceChild* pTiledResourceOrViewAccessAfterBarrier );
Tiled Resources can be used in Shader Resource Views, Render Target Views, Depth Stencil Views and Unordered Access Views, as well as some bindpoints where Views aren't used, such as Vertex Buffer bindings. See the list of supported bindings earlier. Copy* operations also work on Tiled Resources.
If multiple tile coordinates in one or more views is bound to the same memory location, reads and writes from different paths to the same memory will occur in a nondeterministic/nonrepeatable order of memory accesses.
If all tiles behind a memory access footprint from a shader are mapped to unique tiles, behavior is identical on all implementations to the surface having the same memory contents in a non-tiled fashion.
Behavior for SRV reads that involve non-mapped tiles depends on the level of hardware support - see read behavior in Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements. The following summarizes the ideal behavior (which Tier 2 requires.
Consider a texture filter operation that reads from a set of texels in an SRV. Texels that fall on non-mapped tiles contribute 0 in all non-missing components of the format, and the default for missing components(19.1.3.3), into the overall filter operation alongside contributions from mapped texels. The texels are all weighted and combined together undependent of whether the data came from mapped or non-mapped tiles.
Some first generation Tier 2 level hardware does not meet this spec requirement and returns the 0 with defaults described above as the overall filter result if ANY texels (with nonzero weight) fall on non-mapped tiles. No other hardware will be allowed to miss the requirement to include all (nonzero weight) texels in the filter.
It was considered to have an option to automatically fall back to a coarser mip in some fashion when a filter footprint hits missing tiles, either a the texel level, or just for the entire fetch. However there didn't seem to be a clear advantage here for the cost versus relying on applications figuring out how avoid or deal with missing tiles on their own.
Behavior of UAV reads and writes depends on the level of hardware support. See overall read and write behavior for Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements.
Ideal behavior:
Shader operations that read from a non-mapped tile in a UAV return 0 in all non-missing components of the format, and the default for missing components(19.1.3.3).
Shader operations that attempt to write to a non-mapped tile cause nothing to be written to the non-mapped area (while writes to mapped area proceed). This ideal definition for write handling is not requried by Tier 2 - writes to non-mapped tiles may end up in a cache that subsequent reads could pick up.
Behavior of DSV reads and writes depends on the level of hardware support. See overall read and write behavior for Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements.
Ideal behavior:
If a tile is not mapped in the DepthStencilView, the return value from reading depth is 0, which is then fed into whatever operation(s) are configured for the depth read value. Write to the missing depth tile are dropped. This ideal definition for write handling is not requried by Tier 2 - writes to non-mapped tiles may end up in a cache that subsequent reads could pick up.
Behavior of RTV reads and writes depends on the level of hardware support. See overall read and write behavior for Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements.
On all implementations it is valid for different RTVs (and DSV) bound simultaneously have different areas mapped vs non-mapped and have different sized surface formats (which means different tile shapes).
Ideal behavior:
Reads from RenderTargetViews return 0 in all non-missing components of the format, and the default for missing components(19.1.3.3). Writes to RenderTargetViews are dropped. This ideal definition for write handling is not requried - writes to non-mapped tiles may end up in a cache that subsequent reads could pick up.
If tiles in the source and dest area of a Copy* operation have duplicated mappings in the copy area that would have overlapped even if both resources were not Tiled Resources and the Copy* call supports overlapping copies, this will behave fine (as if the source is copied to a temp before going to the dest). However if the overlap is not obvious (like the source and dest resources are different but share mappings, or mappings are duplicated over a given surface), then results of the copy operation on the tiles that are shared are undefined.
Copying to a Tiled Resource with duplicated tiles in the destination area produces undefined results in these tiles unless the data itself is identical - different tiles may write the tiles in different orders.
Suppose an Unordered Access View on a Tiled Resource has duplicate tile mappings in its area or with other resources bound to the pipeline. Ordering of accesses to these duplicated tiles is undefined if performed by different threads, just as any ordering of memory access to UAVs in general is unordered.
If a Tiled Resource's Tile Mappings have changed or content in mapped Tiled Pool tiles have changed via another Tiled Resource's mappings, and the Tiled Resource is going to be rendered via RenderTargetView or DepthStencilView, the application must Clear (using the fixed function Clear APIs) or fully copy over using Copy*/Update* APIs the tiles that have changed within the area being rendered (mapped or not). Failure of an application to clear/copy in these cases results in hardware optimization structures for the given RenderTargetView or DepthStencilView being stale and will result in garbage rendering results on some hardware and inconsistency across different hardware. These hidden optimization data structures used by hardware may be local to individual mappings, not visible to other mappings to the same memory.
The ClearView API/DDI supports clearing RenderTargetViews with rects, and for hardware that supports Tiled Resources, ClearView must also support clearing of DepthStencilViews with rects, for depth only surfaces (without stencil). This allows applications to Clear only the necessary area of a surface.
If an application needs to preserve existing memory contents of areas in a Tiled Resources where mappings have changed it has to work around the Clear requirement, unfortunately. This can be accomplished by the application by first saving the contents where Tile mappings have changed (by copying them to a temporary surface, for example using CopyTiles()), issuing the required Clear and then copying the contents back. While this would accomplish the task of preserving surface contents for incremental Rendering, the downside is that is that subsequent rendering performance on the surface may suffer because rendering optimizations may be lost.
If a tile is mapped into multiple Tiled Resources at the same time and tile contents are manipulated by any means (render, copy etc.) via one of the Tiled Resoruces then if the same tile is to be rendered via any other Tiled Resource, the tile must be Cleared first as above.
Suppose an area in a Tiled Resource is being rendered to and the Tile Pool tiles referenced by the render area are also mapped to from outside the render area (including via other Tiled Resources, at the same time or not). Data rendered to these tiles is not guaranteed to appear correctly when viewed through the other mappings, even though the underlying memory layout is compatible. This is due to optimization data structures some hardware uses that can be local to individual mappings for renderable surfaces, not visible to other mappings to the same memory location. This restriction can be worked around by copying from the rendered mapping to all the other mappings to the same memory that might be accessed (or clearing that memory or copying other data to it if the old contents are no longer needed). While this seems redundant, it makes all other mappings to the same memory correctly understand how to access its contents, and at least the memory savings of having only a single physical memory backing remains intact. Also, note that when switching between using different Tiled Resources that share mappings (unless only reading), the TiledResourceBarrier API must be called in between.
If an area in a Tiled Resources is being rendered to and within the render area multiple tiles are mapped to the same Tile Pool locaition, rendering results are undefined on those tiles.
Suppose multiple Tiled Resources have mappings to the same Tile Pool locations and each resource is used to access the same data. This is only valid if the other rules about avoiding problems with hardware optimization structures are avoided, appropriate calls to TiledResourceBarrier made and the Tiled Resources are compatible with each other. The latter is described here (in terms of what it means for Tiled Resources sharing tiles to be incompatible). The conditions incompatibility accessing the same data across duplicate tile mappings are the use of different surface dimensions, format, or differences the presence of RenderTarget or DepthStencil BindFlags on the Resources. Writing to the memory with one type of mapping produces undefined results if subsequently reading or rendering via a mapping from an incompatible Resource. If the other Resource sharing mappings will be first initialized with new data (recycling the memory for a different purpose), that is fine since data is not bleeding across incompatible interpretations, however the TiledResourceBarrier API must be called when switching between accessing incompatible mappings like this.
If the RenderTarget or DepthStencil BindFlag is not set on any of the resources sharing mappings with each other, there are far fewer restrictions: As long as the format and surface types (e.g Texture2D) are the same, tiles can be shared. Some cases of different format are compatible such as BC* surfaces and the equivalent sized uncompressed 32 bit or 16 bit per component format, like BC6H and R32G32B32A32. Many 32 bit per element formats can be aliased with R32_* as well (R10G10B10A2_*, R8G8B8A8_*, B8G8R8A8_*,B8G8R8X8_*,R16G16_*) - this has always been allowed for non Tiled Resources.
Sharing between packed and non-packed tiles is fine if the formats are compatible and the tiles are filled with solid color.
Finally, if nothing is common about the Resources sharing tile mappings except that none have RenderTarget/DepthStencil BindFlags, then only memory filled with 0 can be shared safely - it will appear as whatver 0 decodes to for the definition of the given Resource format (typically just 0).
The texture sampling features described here require Tier(5.9.7) 2 level of Tiled Resources support.
Any instruction that reads and/or writes to a Tiled Resource causes status information to be recorded. This is exposed as an optional extra return value on every resource access instruction that goes into a 32-bit temp register. The contents of the return value are opaque - direct reading by the shader program is disallowed. However dedicated instruction(s) (initially only one) allow status information to be extracted.
The check_access_mapped(22.4.26) instruction interprets the status return from a memory access and indicates whether all data being accessed was mapped in the resource - true (0xFFFFFFFF) or false (0x00000000).
During filter operations, sometimes the weight of a given texel ends up being 0.0. An example is a linear sample with texture coordinates that fall directly on a texel center: 3 other texels (which ones they are can vary by hardware) contribute to the filter - but with 0 weight. These 0 weight texels do not contribute to the filter result at all, so if they happen to fall on NULL tiles they don't count as an unmapped access. Note the same guarantee applies for texture filters that include multiple mip levels - if the texels on one of the mipmaps is not mapped but the weight on those texels is 0, those texels don't count as an unmapped access.
When sampling from a format that has fewer than 4 components (such as DXGI_FORMAT_R8_UNORM), any texels that fall on NULL tiles result in the a NULL mapped access being reported regardless of which component(s) the shader actually looks at in the result. For example reading from R8_UNORM and masking the read result in the shader with .gba/.yzw wouldn't appear to need to read the texture at all, but if the texel address is a NULL mapped tile it still counts as a NULL map access.
The shader can check the status and pursue any desired course of action on failure. For example logging 'misses' (say via UAV write) and/or issuing another read clamped to a coarser LOD known to be mapped. It may be useful for an application to track successful accesses as well in order to get a sense of what portion of the mapped set of tiles got accessed.
One complication for logging is there is no mechanism for reporting the exact set of tiles that would have been accessed. The application can make conservative guesses based on knowing the coordinates it used for access, as well as using the lod instruction which returns what the hardware lod calculation is.
Another complication is that lots of accesses will be to the same tiles, so there will be a lot of redundant logging and possibly contention on memory. It could be convenient if the hardware could be given the option to not bother to report tile accesses if they were reported elsewhere before. Perhaps the state of such tracking could be reset from the API (likely at frame boundaries).
To help shaders avoid areas in mipmapped Tiled Resources that are known to be non-mapped, most shader instructions that involve using a Sampler (filtering) have a new mode that allows the shader to pass an additional float32 MinLOD clamp parameter to the texture sample. This value is in the View's mipmap number space, as opposed to the underlying resource.
The hardware performs max(fShaderMinLODClamp,fComputedLOD) in the same place in the LOD calculation where the per-Resource MinLOD clamp occurs (which is also a max()).
If the result of applying the Per-sample LOD clamp and any other LOD clamps defined in the sampler is an empty set, the result is the same out of bounds access result as the per-Resource minLOD clamp: 0 for components in the surface format and defaults for missing components.
The lod instruction (which predates the per-sample minLOD clamp described here) returns both a clamped and unclamped LOD. The clamped LOD return from this lod instruction reflects all clamping including the per-resource clamp, but not a per-sample clamp. Per-sample clamp is controlled/known by the shader anyway, so the shader author can manually apply that clamp to the lod instruction's return value if desired.
The following shader instructions include combinations of feedback and/or clamp in addition to their basic operation, followed by instructions that examine the feedback return. If the clamp is used, it is an additional scaler float32 register or immediate operand. If feedback is requested, it comes out in an additional 32 bit scalar register operand that needs to be fed into instruction(s) that interpret feedback.
These instructions can be used on Tiled or non-Tiled Resources for all applicable resource dimensions (Buffer, Texture1D/2D/3D). Non-Tiled Resources always appear to be fully mapped.
The suffix _s indicates mapping status, and _cl indicates LOD clamp.
The following instructions have a mapping status return option [_s] (but no clamp option):
The following instructions have both mapping status [_s] and clamp [_cl] options:
The following instruction examines the status return from any of the above instructions:
Note there is no feedback for memory write instructions like store_uav_*. This could be added if needed, but at this time of design some hardware does not support it.
Applications may choose to manage their own data structures that inform them of what the mappings looks like for a Tiled Resource. An example would be a surface that contains a texel to hold information about for every tile in a Tiled Resource. One might store the first LOD that is mapped at a given tile location. By careful sampling of this data structure in a similar way that the Tiled Resource is intended to be sampled, one might discover what the minimum LOD that is fully mapped for an entire texture filter footprint will be. To help make this process easier, a new general purpose sampler mode is introduced, min/max filtering.
Note there is disagreement among IHVs on the utility of min/max filtering for LOD tracking. It hasn't been proven. However, the feature may be useful for other purposes, such as perhaps the filtering of depth surfaces.
Min/Max Reduction filtering is a mode on Samplers that fetches the same set of texels that a normal texture filter would fetch, but instead of blending the values to produce an answer, it returns the min() or max() of the texels fetched, on a per-component basis (e.g. the min of all the R values, separately from the min of all the G values etc.)
The min/max operations follow D3D arithmetic precision rules. The order of comparisons does not matter.
During filter operations that are not min/max, sometimes the weight of a given texel ends up being 0.0. An example is a linear sample with texture coordinates that fall directly on a texel center - 3 other texels (which ones they are may vary by hardware) contribute to the filter but with 0 weight. For any of these texels that would be 0 weight on a non-min/max filter, if the filter is min/max these texels still do not contribute to the result (and the weights do not otherwise affect the min/max filter operation).
The full list of filter modes is shown in the D3D11_FILTER enum in the Sampler State(7.18.3) section - note the modes with MINIMUM and MAXIMUM in the name.
Support for this feature depends on Tier(5.9.7) 2 support for Tiled Resources.
New HLSL syntax is required to support tiled resources in
Shader Model 5.0 (allowed only on devices with Tiled Resources support).
Each relevant HLSL intrinsic method for tiled resources (see the table below)
accepts either one (feedback) or two (clamp
and feedback in this order) additional
optional parameters. For example, the Sample method is:
Sample(sampler, location
[, offset [, clamp [, feedback] ] ]).
The offset,
clamp and
feedback parameters are optional.
Programmers have to specify all optional parameters up to the one they need,
which is consistent with the C++ rules for default function arguments. For
example, if the feedback status is
needed, both offset and
clamp parameters need to be explicitly
supplied to Sample, even though they may not be logically needed.
The clamp
parameter is a scalar float value. The literal value of
clamp=0.0f indicates that clamp
operation is not performed.
The feedback
parameter is a uint variable that can
be supplied to memory-access querying intrinsic: CheckAccessFullyMapped. Programmers must not modify or interpret the value of the
feedback parameter; however, the
compiler does not provide any advanced analysis and diagnostics to detect this.
There is one HLSL intrinsic to query the feedback status:
bool CheckAccessFullyMapped(in
uint FeedbackVar);
CheckAccessFullyMapped
interprets the value of FeedbackVar
and returns true if all data being
accessed was mapped in the resource; otherwise,
CheckAccessFullyMapped returns
false.
If either clamp
or feedback parameter is present, the
compiler emits a variant of the basic instruction. For example, Sample of a
tiled resource generates sample_cl_s instruction. If neither
clamp nor
feedback is specified, the compiler
emits the basic instruction, so that there is no change from the current
behavior. The clamp value of 0.0f
indicates that no clamp is performed; thus, the driver compiler can further
tailor the instruction to the target hardware. If
feedback is a NULL register in an
instruction, the feedback is unused;
thus, the driver compiler can further tailor the instruction to the target
architecture.
If the HLSL compiler infers that
clamp is 0.0f and feedback
is unused, the compiler emits the corresponding basic instruction (e.g., sample
rather than sample_cl_s).
If a tiled resource access consists of several constituent
byte code instructions, e.g., for structured resources, the compiler aggregates
individual feedback values via the OR operation to produce the final feedback
value. Therefore, programmers see a single feedback value for such a complex
access.
This is the summary table of HLSL intrinsic methods changed
to support feedback and/or clamp. These all work on tiled and non-tiled resources of all dimensions.
Non-tiled resources always appear to be fully mapped.
HLSL Objects | Intrinsic methods with feedback
option (*) - also has clamp option |
[RW]Texture2D [RW]Texture2DArray TextureCUBE TextureCUBEArray |
Gather GatherRed GatherGreen GatherBlue GatherAlpha GatherCmp GatherCmpRed GatherCmpGreen GatherCmpBlue GatherCmpAlpha |
[RW]Texture1D [RW]Texture1DArray [RW]Texture2D [RW]Texture2DArray [RW]Texture3D TextureCUBE TextureCUBEArray |
Sample* SampleBias* SampleCmp* SampleCmpLevelZero SampleGrad* SampleLevel |
[RW]Texture1D [RW]Texture1DArray [RW]Texture2D Texture2DMS [RW]Texture2DArray Texture2DArrayMS [RW]Texture3D [RW]Buffer [RW]ByteAddressBuffer [RW]StructuredBuffer |
Load |
This existing DDI includes new options on the MiscFlags parameter:
D3DWDDM1_3DDI_RESOURCE_MISC_TILED : Indicates the resource is tiled. Constraints on when this flag can be used are described elsewhere. D3DWDDM1_3DDI_RESOURCE_MISC_TILE_POOL : Indicates the resource is a tile pool. Must be a Buffer, with usage DEFAULT. Full constraints described elsewhere.
This existing enum for filter types has new entries for min/max filtering.
typedef enum D3D10_DDI_FILTER { // Bits used in defining enumeration of valid filters: // bits [1:0] - mip: 0 == point, 1 == linear, 2,3 unused // bits [3:2] - mag: 0 == point, 1 == linear, 2,3 unused // bits [5:4] - min: 0 == point, 1 == linear, 2,3 unused // bit [6] - aniso // bits [8:7] - reduction type: // 0 == standard filtering // 1 == comparison // 2 == min // 3 == max // bit [31] - mono 1-bit (narrow-purpose filter) D3D10_DDI_FILTER_MIN_MAG_MIP_POINT = 0x00000000, D3D10_DDI_FILTER_MIN_MAG_POINT_MIP_LINEAR = 0x00000001, D3D10_DDI_FILTER_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000004, D3D10_DDI_FILTER_MIN_POINT_MAG_MIP_LINEAR = 0x00000005, D3D10_DDI_FILTER_MIN_LINEAR_MAG_MIP_POINT = 0x00000010, D3D10_DDI_FILTER_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000011, D3D10_DDI_FILTER_MIN_MAG_LINEAR_MIP_POINT = 0x00000014, D3D10_DDI_FILTER_MIN_MAG_MIP_LINEAR = 0x00000015, D3D10_DDI_FILTER_ANISOTROPIC = 0x00000055, D3D10_DDI_FILTER_COMPARISON_MIN_MAG_MIP_POINT = 0x00000080, D3D10_DDI_FILTER_COMPARISON_MIN_MAG_POINT_MIP_LINEAR = 0x00000081, D3D10_DDI_FILTER_COMPARISON_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000084, D3D10_DDI_FILTER_COMPARISON_MIN_POINT_MAG_MIP_LINEAR = 0x00000085, D3D10_DDI_FILTER_COMPARISON_MIN_LINEAR_MAG_MIP_POINT = 0x00000090, D3D10_DDI_FILTER_COMPARISON_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000091, D3D10_DDI_FILTER_COMPARISON_MIN_MAG_LINEAR_MIP_POINT = 0x00000094, D3D10_DDI_FILTER_COMPARISON_MIN_MAG_MIP_LINEAR = 0x00000095, D3D10_DDI_FILTER_COMPARISON_ANISOTROPIC = 0x000000d5, WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_MIP_POINT = 0x00000100, WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_POINT_MIP_LINEAR = 0x00000101, WDDM1_3DDI_FILTER_MINIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000104, WDDM1_3DDI_FILTER_MINIMUM_MIN_POINT_MAG_MIP_LINEAR = 0x00000105, WDDM1_3DDI_FILTER_MINIMUM_MIN_LINEAR_MAG_MIP_POINT = 0x00000110, WDDM1_3DDI_FILTER_MINIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000111, WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_LINEAR_MIP_POINT = 0x00000114, WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_MIP_LINEAR = 0x00000115, WDDM1_3DDI_FILTER_MINIMUM_ANISOTROPIC = 0x00000155, WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_MIP_POINT = 0x00000180, WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_POINT_MIP_LINEAR = 0x00000181, WDDM1_3DDI_FILTER_MAXIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000184, WDDM1_3DDI_FILTER_MAXIMUM_MIN_POINT_MAG_MIP_LINEAR = 0x00000185, WDDM1_3DDI_FILTER_MAXIMUM_MIN_LINEAR_MAG_MIP_POINT = 0x00000190, WDDM1_3DDI_FILTER_MAXIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000191, WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_LINEAR_MIP_POINT = 0x00000194, WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_MIP_LINEAR = 0x00000195, WDDM1_3DDI_FILTER_MAXIMUM_ANISOTROPIC = 0x000001d5 D3D10_DDI_FILTER_TEXT_1BIT = 0x80000000 // Only filter for R1_UNORM format } D3D10_DDI_FILTER;
typedef struct D3DWDDM1_3DDI_TILED_RESOURCE_COORDINATE { // Coordinate values below index tiles (not pixels or bytes). UINT X; // Used for buffer, 1D, 2D, 3D UINT Y; // Used for 2D, 3D UINT Z; // Used for 3D UINT Subresource; // indexes into mips, arrays. Used for 1D, 2D, 3D // For mipmaps that are packed into a single tile, any subresource // value that indicates any of the packed mips all refer to the same tile. }; typedef struct D3DWDDM1_3DDI_TILE_REGION_SIZE { UINT NumTiles; BOOL bUseBox; // TRUE: Uses width/height/depth parameters below to define the region. // width*height*depth must match NumTiles above. (While // this looks like redundant information, the application likely has to know // how many tiles are involved anyway.) // The downside to using the box parameters is that one update region cannot // span mipmaps (though it can span array slices via the depth parameter). // // FALSE: Ignores width/height/depth parameters - NumTiles just traverses tiles in // the resource linearly across x, then y, then z (as applicable) then spilling over // mips/arrays in subresource order. Useful for just mapping an entire resource // at once. // // In either case, the starting location for the region within the resource // is specified as a separate parameter outside this struct. UINT Width; // Used for buffer, 1D, 2D, 3D UINT16 Height; // Used for 2D, 3D UINT16 Depth; // For 3D or arrays. For arrays, advancing in depth skips to next slice of same mip size. }; typedef enum D3DWDDM1_3DDI_TILE_MAPPING_FLAG { D3DWDDM1_3DDI_TILE_MAPPING_NO_OVERWRITE = 0x00000001, }; typedef enum D3DWDDM1_3DDI_TILE_RANGE_FLAG { D3DWDDM_1_3DDI_TILE_RANGE_NULL = 0x00000001, D3DWDDM_1_3DDI_TILE_RANGE_SKIP = 0x00000002, D3DWDDM_1_3DDI_TILE_RANGE_REUSE_SINGLE_TILE = 0x00000004, }; typedef enum D3DWDDM1_3DDI_TILE_COPY_FLAG { D3DWDDM1_3DDI_TILE_COPY_NO_OVERWRITE = 0x00000001, D3DWDDM1_3DDI_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE = 0x00000002, D3DWDDM1_3DDI_TILE_COPY_SWIZZLED_TILED_RESOURCE_TO_LINEAR_BUFFER = 0x00000004, }; typedef enum D3DWDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAG { D3DWDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_TILED_RESOURCE = 0x00000001, };
// -------------------------------------------------------------------------------------------------------------------------------- // UpdateTileMappings // -------------------------------------------------------------------------------------------------------------------------------- // See API - runtime simply passes through parameters after validation of most parameters except that tile regions actually // fit on the specified resource. The driver should ignore individual regions that are invalidly specified and then drop the // remainder of the call (no need to back out progress so far). The debug runtime validates the parameters fully. // // Errors are reported via the call back pfnSetErrorCb. Valid errors are out of memory and device removed. On out of memory // (possible if memory allocation for page table storage fails), tile mappings are left in their original state before the call. // // If a driver implements commandlists and out of memory occurs when executing UpdateTileMappings in a commandlist, // the driver must invoke device removed. Applications can avoid this situation by only doing update calls that change existing // mappings from Tiled Resources within command lists (so drivers will not have to allocate page table memory, only change the mapping). // // Note that many of the array parameters are optional and take special meaning if NULL as follows: // If pTiledResourceRegionStartCoordinates is NULL at the API (only allowed if NumTiledResourceRegions is 1), the runtime fills in a default // coordinate of {0,0,0,0} that is passed to the DDI (so the DDI will never see NULL). // If pTiledResourceRegionSizes is NULL at the DDI, all regions are assumed to be a single tile. At the API if NumTiledResourceRegions 1, // pTiledResourceregionStartCoordinates is NULL and pTiledResourceRegionSizes is NULL, the runtime calls the DDI with pTiledResourceRegionSizes // filled in to cover the entire resource (so the DDI won't see NULL for pTiledResourceRegionSizes in this case). // // If pRangeFlags is NULL, all tile ranges have 0 for Range Flags. // If pRangeTileCounts is NULL, all tile ranges have size 1 tile. // If pRangeFlags[i] specifies WDDM1_3DDI_TILE_MAPPING_NULL or _SKIP, the corresponding entry in pTilePoolStartOffsets[i] is ignored, // and if the call defines nothing but NULL/SKIPs pTilePoolStartOffsets can be NULL. // // At the API if NumRanges is 1 and pRangeTileCounts is 0, the runtime automatically fills in pRangeTileCounts[0] with the // total number of tiles specified by all the Tile Regions. // // See the API description for examples of common calling patterns - it might make sense for drivers to special-case some of // these if it turns out they could be executed more efficiently than through the path that handles the most general case. // typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_UPDATETILEMAPPINGS )( D3D10DDI_HDEVICE hDevice, D3D10DDI_HRESOURCE hTiledResource, UINT NumTiledResourceRegions, _In_reads_(NumTiledResourceRegions) const D3DWDDM1_3DDI_TILED_RESOURCE_COORDINATE* pTiledResourceRegionStartCoordinates, _In_reads_opt_(NumTiledResourceRegions) const D3DWDDM1_3DDI_TILE_REGION_SIZE* pTiledResourceRegionSizes, D3D10DDI_HRESOURCE hTilePool, UINT NumRanges, _In_reads_opt_(NumRanges) const UINT* pRangeFlags, // D3DWDDM1_3DDI_TILE_RANGE_FLAG _In_reads_opt_(NumRanges) const UINT* pTilePoolStartOffsets, _In_reads_opt_(NumRanges) const UINT* pRangeTileCounts, UINT Flags // D3DWDDM1_3DDI_TILE_MAPPING_FLAG ); // -------------------------------------------------------------------------------------------------------------------------------- // CopyTileMappings // -------------------------------------------------------------------------------------------------------------------------------- // See API - runtime simply passes through parameters with minimal validation (it does drop the call if the regions don't fit). // // Errors are reported via the call back pfnSetErrorCb. Valid errors are out of memory and device removed. On out of memory // (possible if memory allocation for page table storage fails), tile mappings are left in their original state before the call. // // If a driver implements commandlists and out of memory occurs when executing CopyTileMappings in a commandlist, // the driver must invoke device removed. Applications can avoid this situation by only doing copy calls that change existing // mappings from Tiled Resources within command lists (so drivers will not have to allocate page table memory, only change the mapping). // typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_COPYTILEMAPPINGS )( D3D10DDI_HDEVICE hDevice, D3D10DDI_HRESOURCE hDestTiledResource, _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pDestRegionStartCoordinate, D3D10DDI_HRESOURCE hSourceTiledResource, _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pSourceRegionStartCoordinate, _In_ const WDDM1_3DDI_TILE_REGION_SIZE* pTileRegionSize, UINT Flags // WDDM1_3DDI_TILE_MAPPING_FLAGS ); // -------------------------------------------------------------------------------------------------------------------------------- // CopyTiles // -------------------------------------------------------------------------------------------------------------------------------- // See API - runtime simply passes through parameters with minimal validation. // // This DDI is not expected to fail (runtime will not check). typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_COPYTILES )( D3D10DDI_HDEVICE hDevice, D3D10DDI_HRESOURCE hTiledResource, _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pTileRegionStartCoordinate, _In_ const WDDM1_3DDI_TILE_REGION_SIZE* pTileRegionSize, D3D10DDI_HRESOURCE hBuffer, // Default, dynamic or staging buffer UINT64 BufferStartOffsetInBytes, UINT Flags // WDDM1_3DDI_TILE_COPY_FLAGS ); // -------------------------------------------------------------------------------------------------------------------------------- // UpdateTiles // -------------------------------------------------------------------------------------------------------------------------------- // See API - runtime simply passes through parameters with minimal validation. // // This DDI is not expected to fail (runtime will not check). typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_UPDATETILES )( D3D10DDI_HDEVICE hDevice, D3D10DDI_HRESOURCE hDestTiledResource, _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pDestTileRegionStartCoordinate, _In_ const WDDM1_3DDI_TILE_REGION_SIZE* pDestTileRegionSize, _In_ const VOID* pSourceTileData, // caller memory UINT Flags // WDDM1_3DDI_TILE_COPY_FLAGS ); // -------------------------------------------------------------------------------------------------------------------------------- // TiledResourceBarrier // -------------------------------------------------------------------------------------------------------------------------------- // See API - runtime simply passes through parameters with minimal validation. // // This DDI is not expected to fail (runtime will not check). typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_TILEDRESOURCEBARRIER )( D3D10DDI_HDEVICE hDevice, D3D11DDI_HANDLETYPE TiledResourceAccessBeforeBarrierHandleType, _In_opt_ const VOID* hTiledResourceAccessBeforeBarrier, D3D11DDI_HANDLETYPE TiledResourceAccessAfterBarrierHandleType, _In_opt_ const VOID* hTiledResourceAccessAfterBarrier ); // -------------------------------------------------------------------------------------------------------------------------------- // GetMipPacking // -------------------------------------------------------------------------------------------------------------------------------- // For a given tiled resource, returns how many mips are packed // are packed and how many tiles are needed to store all the packed mips. // Packed mips include cases where multiple small mips share tile(s) and // also mips for which a given device cannot use standard tile shapes. It is possible // for an entire resource to be considered packed. // // Applications are not told the tile shapes/layout for packed mips and must simply map // all or none of the packed tiles if any of the mipmaps with are to be accessed. // Otherwise the observed mapping of individual pixels accessed will be undefined - IHV specific. // // For array surfaces, the returned values are the counts for a single array slice, // given the tile breakdown is identical for the mipmaps of each array slice. // // Mipmaps whose pixel dimensions fully fill at least one standard shaped tile in all // dimensions are not allowed to be considered part of the set of packed mips, otherwise // the runtime will remove the device on an invalid driver. // One example of dimensions that a device can validly lump into // the packed tiles (meaning the IHV can use its own custom tile breakdown) is // a mip that is at least one tile wide but less than a tile high. Ideally though, // a device would stick with the standard tile breakdown for this case (so the application can // manage the tiles in a standard way). If a device does need to use a custom tiling, // the application is not told what the tile breakdown is (only how many tiles are involved // in the packing overall), and thus loses some freedom. // typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_GETMIPPACKING )( D3D10DDI_HDEVICE hDevice, D3D10DDI_HRESOURCE hTiledResource, _Out_ UINT* pNumPackedMips, // How many mips are packed, for a given array slice, // including any mips that don't use the standard tile // shapes. If there is no packing, return 0. _Out_ UINT* pNumTilesForPackedMips, // How many tiles the packed mips fit into, // for a given array slice. Ignored if // *pNumPackedMips returned 0. ); // -------------------------------------------------------------------------------------------------------------------------------- // CheckMultisampleQualityLevels // -------------------------------------------------------------------------------------------------------------------------------- // Variant of the existing DDI for checking multisample quality level support with a new flags field that allows // tiled resource to be specified. // typedef enum WDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS { WDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_TILED_RESOURCE = 0x00000001, }; typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_CHECKMULTISAMPLEQUALITYLEVELS )( D3D10DDI_HDEVICE hDevice, DXGI_FORMAT Format, UINT SampleCount, UINT Flags, // WDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS _Out_ UINT* pNumQualityLevels ); // -------------------------------------------------------------------------------------------------------------------------------- // ResizeTilePool // -------------------------------------------------------------------------------------------------------------------------------- // See API - runtime simply passes through parameters with minimal validation (it does fail the API call if the size is not a multiple // of tile size or 0). // // Errors are reported via the call back pfnSetErrorCb. Valid errors are out of memory and device removed. On out of memory, // tile mappings are left in their original state before the call. // typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_RESIZETILEPOOL )( D3D10DDI_HDEVICE hDevice, D3D10DDI_HRESOURCE hTilePool, UINT64 NewSizeInBytes );
This section is not part of the requirements for the initial implementation of Tiled Resources - it is for future consideration only.
Texture filtering shader instructions can view Texture2DArray Resources as if all the array slices are arranged in a "quilt"/grid that appears as one surface rather than an array of them.
The term "quilt" is meant to evoke the analogy of a collection of rectangular pieces of fabric that have been stitched together in a grid, but instead of fabric, the pieces are slices of a Texture2DArray.
This enables applications to achieve texture filtering on surfaces that appear far larger than the size limits for individual Texture2D surfaces imposed by D3D.
Ideally, double precision texture coordinate interpolation would be supported, so that precision could be maintained when interpolating and representing normalized coordinate values over surfaces that are too large for float32 precision (D3D's texture size limits are basically already there). However requiring double precision, and furthermore, requiring hardware to support individual surfaces that scale indefinitely in size, is out of scope in the timeframe for this feature.
Any Texture2DArray Resource that is not Multisampled can have a Quilted Shader Resource View created on it. Starting with a Texture2DArray Resource, the following parameters describe how to define a Quilt:
// Descriptor for building a Quilt SRV from a Texture2DArray typedef struct D3D11_TEX2D_QUILT_SRV { UINT MostDetailedMip; UINT MipLevels; UINT FirstArraySlice; // First slice to use in the quilt (does this have to be 0?) UINT QuiltWidthInArraySlices; UINT QuiltHeightInArraySlices; }; // Array slices are assigned into the Quilt starting from FirstArraySlice // at the top-left of the Quilt, progressing in row order. // e.g. if FirstArraySlice is 0, the width is 2 and the height is 2, // The array slices map to the quilt like this: // 0 1 // 2 3
An IHV requested constraints on the Quilt Width/Height. One constraint could be the max QuiltWidthInArraySlices is 32, same for Height. And these dimensions may have to be pow2, though the Quilt should at least be allowed to be non-square in ArraySlices.
One observation is that even if Quilt dimensions are constrained to pow2, applications that wish to represent nonPow2 overall surface dimensions (at the texel level) can still pick nonPow2 dimensions for the individual Array slices (all the same).
Either Tiled or non-Tiled Resources can be used for a Quilt SRV, though Tiled Resources will likely be far more practical for managing massive surfaces.
Shaders have to declare the dimension (e.g. Texture2D) of any SRV they access. This applies to Quilted Texture2D SRVs as well (the Quilt property will be part of the dimensionality naming).
Any Shader instruction that involves the texture filtering hardware (e.g. instructions that take a Sampler as a parameter) sees the Quilting on a Quilted Texture2D, but addresses the surface using the same coordinates as if it is a Texture2DArray. That means that the texture coordinates include an integer array slice in addition to the U/V normalized coordinates. The U/V normalized coordinates are relative to the selected array slice. So coordinates in the range [0..1] span the selected array slice, just like a normal Texture2DArray. However U/V coordinates outside [0..1] refer to the appropriate neighboring array slice in the Quilt layout. e.g. a U coordinate of 1.5 indicates the middle of the array slice to the right in the quilt. The texture filtering hardware knows how to navigate the quilt in this fashion for each individual texel that is fetched.
This Quilt traversal ability is similar to the way the texture filtering hardware also understands how to navigate across a TextureCube from face to face.
Hardware derivative calculations do not understand anything about Quilting; they are not able to remap coordinates from different array slices into the same number space.
For hardware derivative calculations (e.g. used in mipmap LOD calculation) to work correctly on Quilted texture coordinates, applications can simply use the same array-slice for all the coordinates in a given primitive (e.g. triangle). If a triangle spans multiple array slices, the coordinates would have to be mapped to the normalized space of any one of the array slices, making use of texture coordinates outside [0..1].
The ability of the filtering hardware to traverse over the Quilt applies to the mipmaps as well.
The number of mipmaps available to a given Array Slice is limited by the dimensions of the individual Array slice. This means that a Quilt Texture2D never has all mipmaps available to it (like a pyramid with the top chopped off). The effective size of the coarsest mipmap in a Quilt is the Quilt dimensions in texels (the 1x1 mip from each Array Slice quilted together).
If an application really needs to model a full mipmap pyramid while using Quilts, it must resort to something like creating a second texture that "caps" the pyramid. The "cap" might overlap one mip level with the Quilt (so linear filtering across mips remains well posed). Then at the time of sampling, the application can choose to sample from either the Quilt texture and the "cap" texture based on the LOD.
When an application is generating mipmap data for a Quilt, it would be incorrect to generate the mipmap chain for each Array Slice's mip chain independently. Instead, the mipmap contents should be calculated as if the Quilt is one huge surface. That is what the texture filtering hardware is assuming.
When falling off an edge of the entire Quilt, the coordinate wraps to the other side of the entire Quilt. The Sampler addressing configuration (wrap/mirror/border etc.) is ignored for Quilts.
This constraint to wrap-only was requested by an IHV. Ideally, all addressing modes available to non-Quilt surfaces (wrap, border, clamp etc.) would operate as expected when sampling off the end of a Quilt.
The resinfo instruction (which reports texture dimensions to the shader) reports the dimensions of a Quilted Texture2D not in terms of the underlying Texture2DArray but rather as if it is a large non-array texture whose width/height span the quilt. The number of mipmaps is of course the same for every array slice as for the entire quilt.
Windows Blue exposes Tiled Resources support in two tiers using caps. In future releases, a new tier may be added including the recommendations listed below.
The CheckFeatureSupport DDI has a query for Tiled Resources support:
This query reports support via flags bitfield to allows for some amount of future expansion of the caps reporting at the DDI needed. The Tier flags are cumulative (if the runtime sees Tier 2 support it assumes Tier 1 support regardless of the flag).
typedef enum D3DWDDM1_3DDI_TILED_RESOURCES_SUPPORT_FLAG { D3DWDDM1_3DDI_TILED_RESOURCES_TIER_1_SUPPORTED = 0x00000001, D3DWDDM1_3DDI_TILED_RESOURCES_TIER_2_SUPPORTED = 0x00000002, } D3DWDDM1_3DDI_TILED_RESOURCES_SUPPORT_FLAG; // D3DWDDM1_3DDICAPS_D3D11_OPTIONS1 typedef struct D3DWDDM1_3DDI_D3D11_OPTIONS_DATA1 { UINT TiledResourcesSupportFlags; } D3DWDDM1_3DDI_D3D11_OPTIONS_DATA1;
At the API, the Tiers are exposed via CheckFeatureSupport using an enum for the Tiers. Support for Min/Max Filtering is called out as a separate cap since the feature is distinct from Tiled Resources, however the runtime simply sets this capability true for hardware that supports Tier 2 and false for any lower level.
typedef enum D3D11_TILED_RESOURCES_TIER { D3D11_TILED_RESOURCES_NOT_SUPPORTED = 0, D3D11_TILED_RESOURCES_TIER_1 = 1, D3D11_TILED_RESOURCES_TIER_2 = 2, } D3D11_TILED_RESOURCES_TIER; typedef struct D3D11_FEATURE_DATA_D3D11_OPTIONS1 { D3D11_TILED_RESOURCES_TIER TiledResourcesTier; BOOL MinMaxFiltering; } D3D11_FEATURE_DATA_D3D11_OPTIONS1;
The CheckMultisampleQualityLevels1 API and corresponding CheckMultisampleQualityLevels DDI now has a flags field to allow the driver to be queried for their level of support for Multisampling on Tiled Resources (which can be different from the level of support for non-tiled resources - the number of Quality Levels for example).
Chapter Contents
(back to top)
6.1 Features
6.2 Thread Re-entrant Create routines
6.3 Command Lists
6.4 DDI Features and Changes
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
The objectives of the features described in this section are to enable efficient distribution of rendering workload/ overhead in the application, runtime, and driver across multiple CPU cores in D3D11. These architectural changes are designed to allow multithreaded rendering applications to be written without overbearing restrictions, and gain close to the expected efficiency advantages when doing so.
The primary features discussed are:
A separate D3D11 API/DDI spec contains more concrete implementation details about the topics discussed here.
Applications would like to create all object types (most particularly resources and shaders) on different threads simultaneously and in parallel with other rendering threads, especially to enable background or bulk loading/ compiling. D3D11 will continue to rely on shared resources to achieve fully parallel GPU usage or multi-GPU usage, which effectively means only limited resource sharing is available for such scenarios. Lastly, the ability to generate Command Lists also fits in well when trying to leverage multi-core CPUs, as long as each Command List can be built on separate CPU threads. However, Command Lists are still required to be executed by the one thread that is, generally, dedicated as the render thread.
It is important to note that although Command Lists are reusable across frames, the design point for this feature is use-once. Command List creation overhead in the runtime and driver should be low enough that single-use for the sole purpose of distribution of work across threads provides a significant performance win. Likewise, the overhead of submitting the Command List in the main rendering thread (immediate context) should be minimized – the design should diminish any need to patch or recompile Command Lists. If multi-use optimizations become interesting, implementations are encourages to promote such optimizations once a use-threshold has been reached. While the use of a single-use hint flag has been considered, detecting multi-use seems best to avoid application abuse/ mis-use of hints.
Overview (the names here were chosen to align with kernel concepts to promote quicker understanding, and do not represent the final API or DDI):
The main aspects to notice are: the separation of IDevice from IContext (as IContext is expected to be implemented by two types of Contexts), the concept of a single Immediate Context per Device, the possibility of multiple Deferred Contexts, the Command List object types, and all the new methods that deal with these new objects. It is not expected that Map, Unmap, and GetData will work on a Deferred Context, while Finalize will not work on the Immediate Context. Further details and options are provided later.
D3D11 allows creation routines to be thread re-entrant, as highlighted in the diagram by grouping such methods on the IDevice interface. This is not accomplished with coarse-grained critical sections. Fine-grained critical-sections are required internally, when necessary. Ideally, no internal synchronization needs to occur; but that is probably not realistic. Not only can one thread be rendering (i.e. calling Draw) while another thread is calling CreateShader; but two threads can be calling CreateShader, while a third thread calls CreateResource, and a fourth is rendering, etc. Due to symmetry, destruction of objects will also be re-entrant. However, the typical destruction of an object goes through multiple stages to keep destruction performant. See Deferred Destruction(6.4.3) for details.
In the D3D10 timeframe, the majority of drivers treated Initial Data passed to the Create functions equivalent to using UpdateSubresource, which is technically a rendering command that naturally presents obstacles for separating creation and rendering. In addition, the UpdateSubresource path would typically force the resource to be faulted into video memory. With changes to the OS kernel, the driver can use the Map/ Unmap path for Initial Data; but this path is unavailable for both Vista and Windows 7. Unfortunately, drivers are required to significantly change their current implementation surrounding this feature, in order to concurrently upload initial data without significantly perturbing the render thread/ frame rate. This is viewed as short-term pain, until the desired kernel changes are available, with an unknown duration for short-term.
Section Contents
(back to chapter)
6.3.1 Overview
6.3.2 Fire and Forget Model, No Feedback
6.3.3 No Context State Inheritance
6.3.4 No Context State Aftermath
6.3.5 Object State Inheritance & Aftermath
6.3.6 Query Interactions
6.3.7 Nested Command Lists
6.3.8 Allow Map Write on Resources with Restriction
6.3.9 Application Immutable, but Patching is Still Required
The concept of a Command List has been around in other graphics APIs, and partially supported by features in previous versions of Direct3D. Instead of immediately executing graphics commands (or giving the impression of such a model), the graphics commands are recorded for execution later. In the overview, the Deferred Context represents the facility to generate Command Lists. Command Lists work well when supporting multi-core CPUs. Command Lists can be generated by separate threads, although they must be manually executed via the render thread using the Immediate Context. The threading model is that a Context (either Immediate or Deferred) cannot be manipulated by more than one CPU thread simultaneously. Two Contexts, however, can be manipulated simultaneously, in parallel with each other, etc. After generation, a Command List can be used multiple times; but cannot be altered by the application explicitly. The interface for a Deferred Context is generally the same as the Immediate Context, with some exceptions. After work has been built up with a Deferred Context, the Command List must be generated by invoking Finalize. By default, Finalize will leave the Deferred Context in a zombie state, waiting for the Deferred Context to be destroyed. However, there will be an option to reset the Deferred Context and allow a new sequence of commands to be recorded, effectively re-creating the Deferred Context. If specialized IContext methods designed for the Immediate Context are invoked off a Deferred Context, they fail; and vice versa.
Since a Deferred Context is building up a deferred timeline for the GPU, the CPU must restrict itself to only sending data to the GPU in a fire-and-forget manor. Deferred Contexts cannot get any feedback from the GPU. Therefore, Resources cannot be Mapped, allowing read access. Query data cannot be retrieved, etc. Such operations can only be done by the rendering thread manipulating the Immediate Context, as the GPU is actually able to make forward progress and resolve the dependencies on data that the CPU requires.
State Inheritance refers to the ability of the Command List to inherit the current state of the Immediate Context when executed. No Immediate Context state (such as bound render targets nor shaders) can be inherited by the Command List. The state of the Deferred Context always starts out in the default Context state (i.e. equivalent to giving the new Deferred Context ClearState, as its first command or equivalent to the Immediate Context state immediately upon creation).
When a Command List is actually scheduled/ executed on either the Immediate or Deferred Context, the state of the Context (such as bound render targets and shaders) will altered afterward. The state of the Context will revert to the default Context state (ie. equivalent to executing ClearState implicitly immediate after Command List execution).
While Command Lists and the Immediate Context state are effectively sheltered from each other, there is a form of Inheritance and Aftermath that needs to occur to make Command Lists useful: Resources and Query contents, etc. When a Command List executes on the Immediate Context, it inherits and can change the global state of objects, such as texture data, constant buffer data, and query data. Therefore it is possible to generate Command Lists that conditionally do different things, with creative use of Predicates and Resource data.
Query data can be generated by Deferred Contexts, just as Render Target data is generated; and Queries can be wrapped around Command List execution. However, there are some problematic cases that need to be handled, assuming the Query syntax remains unchanged.
First, for Queries that have a Beginning and an End, like Predicates, such bracketing must stay local to a particular Context (i.e. Begin & End must occur within same command timeline). It is not possible for a Begin to happen on one Context to be matched with an End on another Context or Command List. For example, problematic cases are exposed when a bracketing is begun in the Immediate Context and ended by a Command List, and vice versa. This is not allowed, and is enforced. If a Command List manipulates a Query (where the corresponding Deferred Context called Begin or End on the Query), the Command List execution will not be allowed on a Context where the same Query has only been Begun. In addition, any Queries that have been Begun in the Deferred Contexts but not Ended, are implicitly Ended by the invocation to Finalize.
Second, when the Command List was being generated, was it assumed that the Command List execution could’ve been wrapped by any of the available Queries? This can be particularly troubling if a Query has hardware bugs related to it and needs some form of emulation. For example, if Blts are being emulated by the 3d pipeline, such operations are specified not to affect certain Queries. To satisfy the specification, the driver could poll any actively monitored counters and subtract off the Blt contribution from Query results. Such driver workarounds are hard to adapt to the Blts that may occur in a Command List. This does have implications on Software Command List implementations (i.e. it may not be known until Command List execution whether a software fallback will be leveraged, meaning the Deferred Context may need to build multiple types of Command Lists).
Command Lists can call Command Lists, i.e. Execute can be called on a Deferred Context. Once Command List usage becomes popular, preventing nested Command Lists presents an obstacle to quickly offload code from the Immediate Context to a Deferred Context. Reducing the disparity between Deferred Context authoring and Immediate Context authoring, when possible, removes obstacles to Deferred Context usage. Infinite recursion is prevented naturally due to the separation of Command List and Deferred Context (i.e. in order to execute a Command List, the Deferred Context must be Finalized). This also means that nested Command Lists are finalized before they can be called by other Command Lists. There is no limit on the level of Command List indirection; but a practial limit on how deep can be realistically tested.
Executing a Command List from a Deferred Context has the same State Aftermath as executing it on the Immediate Context: an implicit ClearState occurs. The Query restrictions that exist between Immediate Context and Deferred Context also exist for nested Command Lists.
The restriction that Deferred Contexts cannot Map any Resource presents an obstacle to quickly offload code from the Immediate Context to a Deferred Context. Efficiently written software and middleware inevitably use dynamic resources for quick upload to the GPU. Such software would have separate code-paths in order to be Context-agnostic (i.e. run against an Immediate Context or a Deferred Context) if Map is completely disallowed. However, if the first invocation to Map for a Deferred Context was a discard, and all Map were Write-Only, these resource operations can be captured without conceptual complications. The entire operation can be converted to be analogous to the UpdateSubresource scenario on the same Deferred Context. Reducing the disparity between Deferred Context authoring and Immediate Context authoring, when possible, removes obstacles to Deferred Context usage.
For all practical purposes, the application interprets the Command Lists as immutable, (i.e. constant after creation). However, there are some cases that could require modification of the Command List to some degree behind the scenes. These are forms of Resource renaming, though they are accomplished via different means.
Even if Map were not allowed on the Deferred Context, there are still interactions between Command Lists and discarding Map that requires special attention. Imagine this code sequence:
pData = pImmediateContext->Map( pDynamicBuffer, DISCARD ); *pData = 1; pImmediateContext->Unmap( pDynamicBuffer ); pDeferredContext = pDevice->CreateDeferredContext(); pDeferredContext->CopyResource( pStagingBuffer, pDynamicBuffer ); pDisplayList = pDeferredContext->Finalize(); pData = pImmediateContext->Map( pDynamicBuffer, DISCARD ); *pData = 2; pImmediateContext->Unmap( pDynamicBuffer ); pImmediateContext->Execute( pDisplayList ); pData = pImmediateContext->Map( pStagingBuffer, 0 );
The contents of the staging Buffer must be 2, not 1.
The following case is similar to Dynamic Buffers. Even though Present is not allowed on the Deferred Context, there are still interactions between Command Lists and Present that requires special attention. Present rotates the identities of the back buffers, which naturally must affect any Command List that contains references to the Back Buffers.
Resource read-after-write-hazards and other similar issues still need attention. One Command List could be executed which read from a Resource after another Display List that was executed which wrote to the same Resource. It may be feasible to do full pipeline flushes between the Command Lists which are used to achieve multi-CPU thread parallelism. A dual core probably only will execute one of these Command Lists per frame. But, Command Lists which are re-used will have a tendency to be smaller and used many times per frame. Full pipeline flushes may not be acceptable for such Command Lists.
Section Contents
(back to chapter)
6.4.1 Overview
6.4.2 Thread Re-entrant Callback Routines
6.4.3 Deferred Destruction
6.4.4 Context Local Storage Handles
6.4.5 Software Command List Assistance
The need to make certain DDI entry points thread re-entrant implies an increased awareness of threading at the DDI, and naturally, a myriad of changes to keep things efficient and reduce the propensity for bugs. With the increased usage of critical sections come the increased chances for deadlocks. For example, in D3D10, there was a well-defined ordering that critical sections must be acquired and released in, to prevent such deadlocks when holding critical sections simultaneously. If the following type of semantics (i.e. can one component hold a critical section during the invocation into another component) do not fall out of the general design of runtime and DDI, then there is increased burden of documentation and testing. If the API and callbacks could be designed such that the user mode driver needs no synchronization, internally, ensuring no deadlocks occur should be much easier.
With multiple threads in the user mode driver at one time, the DDI callbacks must be thread-safe. The DDI callbacks are generally thin wrappers around the thunks provided by DXGI. They isolate the driver from kernel handles and kernel function signatures. The kernel function signatures may change from OS release to OS release. D3D11 DDI callbacks have identical function signatures and functionality as D3D10 DDI callbacks. However, in contrast to D3D10 DDI callbacks, D3D11 DDI callbacks are designed to be free-threaded when used with a driver that support thread-safe creation. Callbacks used to satisfy creations will need to be thread re-entrant or provide thread re-entrant counterparts. Ideally D3D11 DDI callbacks would be completely free-threaded, but there are few restrictions that still remain. One restriction is that only a single thread can be working against a HCONTEXT at a time. Callbacks that use a HCONTEXT are pfnPresentCb, pfnRenderCb, pfnEscapeCb, pfnDestroyContextCb, pfnWaitForSynchronizationObjectCb, and pfnSignalSynchronizationObjectCb. Thus, if more than one thread is calling these callbacks using the same HCONTEXT, they are required to be synchronized. This is quite natural since these are callbacks that are likely to be called only from the thread that is manipulating the immediate context. Another restriction is that callbacks below are required to be invoked during DDI function calls using the same thread that called the DDI:
pfnDeallocateCb deserves special mention, as it is not required to be called before the driver returns from D3D10DDI_DEVICEFUNCS::pfnDestroyResource for the majority of resource types. Since pfnDestroyResource is a free-threaded function, the driver must defer destruction of the object until it can be efficiently ensured that no existing immediate context reference remains (i.e. that pfnRenderCb is called before calling pfnDeallocateCb). This applies even to shared resources, or any other invocation using HRESOURCE to complement HRESOURCE usage with pfnAllocateCb; but does not apply to primaries.
One of the basic tasks of the API is lifetime management of objects and handles. To stay efficient, the API prefers that object and handle destruction is deferred and amortized by default. Typically, deferment means until the GPU is no longer using the object. However, here, the term is meant to represent that the CPU is no longer using an object. The API will not delete an object whose ref count drops to 0. Instead, every flush of a command buffer gives the API an amortized opportunity to check to find those objects whose ref count is 0 and are no longer bound to the Immediate Context. This list of handles to delete can be provided to the driver to assist with an efficient flush. There may be additional mechanisms to destroy handles to suit all the needs of the API; but the guarantee will still exist that destroyed handles will not be currently bound to any context.
The user mode driver has to manipulate data local to each object/ handle involved, in order to interact with the driver models. For example, allocation lists have to be built up to accompany command buffer submissions. Because all objects are now becoming nearly process-global, modifying data directly associated with these objects would require synchronization. It is more efficient to have an area of memory strongly associated with each object, but also local to a context, allowing CPU thread modification of memory without synchronization. The user mode driver can provide the size required for such memory, to gain efficiency with anything the runtime needs to allocate also.
The runtime provides a default implementation of the Deferred Context that will emulate Command List support. Even if all the API features can be supported directly in hardware, this does help bootstrap a driver faster. In addition, it can possibly be leveraged for debugging.
Chapter Contents
(back to top)
7.1 Instruction Counts
7.2 Common instruction set
7.3 Temporary Storage
7.4 Immediate Constants
7.5 Constant Buffers
7.6 Shader Output Type Interpretation
7.7 Shader Input/Output
7.8 Integer Instructions
7.9 Floating Point Instructions
7.10 Vector vs Scalar Instruction Set
7.11 Uniform Indexing of Resources and Samplers
7.12 Limitations on Flow Control and Subroutine Nesting
7.13 Memory Addressing and Alignment Issues
7.14 Shader Memory Consistency Model
7.15 Shader-Internal Cycle Counter (Debug Only)
7.16 Textures and Resource Loading
7.17 Texture Load
7.18 Texture Sampling
7.19 Subroutines / Interfaces
7.20 Low Precision Shader Support in D3D
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
Full details of the Shader models for each shader stage are provided in dedicated sections elsewhere in the spec. What follows is a discussion of a few general items (not an exhaustive list) that are common to all of the Shader models.
There are no limits on total shader program length or execution time (accounting for loops and subroutines), aside from any limitations in what may be expressed in the shader token format. Clearly longer programs will degrade in performance, but D3D11.3 currently does not specify how steeply performance will degrade relative to program length or execution time given that there are so many variables that might affect performance.
Aside from a few exceptions, the instruction set for all the shader stages are identical. The exceptions are confined to instructions that only make sense in a given Shader unit. For example the sample instruction computes LOD based on derivatives, so sample and sample_b (sample with LOD bias) are only relevant in the Pixel Shader where derivatives are present, while sample_l (sample at selected LOD) and sample_d (sample with application-provided derivatives) is available in all stages.
Temporary storage is composed of a single Element type, which is a 4-tuple of untyped 32-bit quantities. Temporary storage consists of two classes of storage: registers, which are non-indexed single elements; and arrays, which are indexable 1D arrays of elements. Temporary storage is read/write, and is uninitialized at the start of a Shader execution instance. Reads of temporary storage that has not been previously written within a Shader execution instance return undefined values, but cannot return data outside of the address space of the device context.
Temporary registers are declared(22.3.35) r#, and can be used as a temporary operand in D3D11.3 instructions.
Temporary arrays are declared(22.3.36) as x#[n], where “n” is the array length (indexed with 0..n-1). Temporary arrays must be indexed by an r# scalar, statically indexed x# scalar, and/or and optional immediate constant (literal), and can have only one level of index nesting (e.g. x0[x1[r0.x+1].x+1] is not legal, but x0[x1[1].x+1] is legal). A temporary array reference, x#[?], can be used as a temporary operand in D3D11.3 instructions (i.e. anywhere an r# can be used). Out of bounds access to x#[?] is undefined, except that data outside the GPU process context is never visible.
The total quantity of temporary storage per Shader execution instance is 4096 elements, which can be utilized in any combination of registers and arrays. i.e. the total number of r# and x# declared must be <= 4096.
Note that the namespace for r# and x# (the #) are independent. e.g. Suppose r2 and x2[5] are declared. They are independent, but together both count as 6 units of storage against the limit of 4096 temporary registers.
To provide a run-time stack, a program allocates a temporary array of a fixed size. The program should provide its own stack bounds checking, e.g., skip calls if the stack push would exceed the array bounds.
There is no limit on the total number of times a temp registers (the same one or different ones) that can appear in a single instruction or in a shader.
For any instruction source argument that is capable of taking a temporary register, it is also permitted to supply 32-bit immediate scalar or 32-bit immediate 4-vector in the Shader code. Only at most one source operand per instruction may be specified using an immediate value (having up to 4 components). Immediate scalar values used in indexing of registers can only be used once per indexed operand in an instruction, and but these immediate values do not count against the limit of one immediate as a raw source operand. e.g. "add r0, v[1 + r0.x], float4(1.0f,2.0f,3.0f,4.0f)" is valid, since there is only one immediate source operand present (the float4), with the value 1 in the indexing of v[] not counting against the limit.
If a source operand is a Constant Buffer reference (see Constant Buffers below), the reference to a Constant Buffer DOES count against the same limit as immediate values. This allows implementations to provide immediate values through the same hardware path as Constant Buffers if desired. e.g. "add r0, cb0[r1.x], float4(1.0f,2.0f,3.0f,4.0f)" is invalid, since both an immediate value is used as well as a Constant Buffer read in the same instruction.
There is no limit on the total number of times immediate constants can appear in a single instruction or in a shader.
There are 15 slots for ConstantBuffers that can be active per Pipeline stage. Indexing across ConstantBuffers is not permitted. A given ConstantBuffer is accessed as an operand to any Shader operation as if it is an indexable read-only register in the Shader. Unlike other Buffer binding locations in the pipeline, Constant Buffers do not allow Buffer offsets nor custom strides. The stride of the Buffer is assumed to be the Element width of R32G32B32A32_TYPELESS; and the first Element in the Buffer (at Buffer offset zero) is assumed to constant #[ 0 ], when referenced from the Shader.
In Shader code, just as a t# register is a placeholder for a Texture, a cb# register is a placeholder for a ConstantBuffer at "slot" #. A ConstantBuffer is accessed in a Shader using: cb#[index] as an operand to Shader instructions, where 'index' can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or a combination of the two, added together. e.g. "mov r0, cb3[x3[0].x+6]" represents moving Element 7 from the ConstantBuffer assigned to slot 3 into r0, assuming x3[0].x contains 1.
There is no limit on the total number of times constant buffer reads (from any buffer and location in the buffer) that can appear in a single instruction or in a shader.
The declaration of a ConstantBuffer (cb# register) in a Shader includes the following information:
Out of bounds access to ConstantBuffers returns 0 in all components. Out of bounds behavior is always with respect to the size of the buffer bound at that slot.
If the constant buffer bound to a slot is larger than the size declared in the shader for that slot, implementations are allowed to return incorrect data (not necessarily 0) for indices that are larger than the declared size but smaller than the buffer size.
Fetching from a ConstantBuffer slot with no Buffer present always returns 0 in all components for all indices.
With this set of information, different hardware implementations sporting varying degrees of optimization for ConstantBuffer access may make informed decisions about how to compile access to the ConstantBuffer into Shader code. Compiled shaders must never have to recompile just because different ConstantBuffers get bound to the Shader, as the necessary characteristics have been statically declared. Runtime validation (at least in debug) will ensure that the Shader code and the sizes of bound ConstantBuffers satisfy the declarations.
The priorities assigned to ConstantBuffers assist hardware in best utilizing any dedicated constant data access paths/mechanisms, if present. There is no guarantee, however, that accesses to ConstantBuffers with higher priority will always be faster than lower priority ConstantBuffers. It is possible that a higher priority ConstantBuffer could produce slower performance than a lower priority ConstantBuffer, depending on the declared characteristics of the buffers involved. For example an implementation may have some arbitrary sized fast constant RAM not large enough for a couple of high priority ConstantBuffers that a Shader has declared, but large enough to fit a declared low priority ConstantBuffer. Such an implementation may have no choice but to use the standard (assumed slow) texture load path for large high priority ConstantBuffers (perhaps tweaking the cache behavior at least), while placing the lowest priority ConstantBuffer into the (assumed fast) constant RAM.
Applications are able to write Shader code that reads constants in whatever pattern and quantity desired, while still allowing different hardware to easily achieve the best performance possible.
In addition to the aforementioned 15 slots for Constant Buffers, every shader program can declare(22.3.4) a single Immediate Constant Buffer with up to 4096 4-vector values. The data is tied to the shader program permanently, but otherwise behaves (gets accessed) by the shader exactly the same way as Constant Buffers.
There is no limit on the total number of times immediate constant buffer reads (from any location the buffer) can appear in a single instruction or in a shader.
The application is given control over the data type interpretation for Shader outputs (i.e. writing raw integer values vs. writing normalized float values) by simply choosing an appropriate format to interpret the output resource's contents as. See the Formats(19.1) section for detail.
Details on Shader input/output registers (indeed all registers) are provided in the sections dedicated to each Shader unit elsewhere in the spec.
One thing in common about input/output registers for all shaders is that if they are declared(22.3.30) to be dynamically indexable from the shader, and the shader indexes them out of the declared range, results are undefined, although no data from outside the GPU process context is never visible.
Section Contents
(back to chapter)
7.8.1 Overview
7.8.2 Implementation Notes
7.8.3 Bitwise Operations
7.8.4 Integer Arithmetic Operations
7.8.5 Integer/Float Conversion Operations
7.8.6 Integer Addressing of Register Banks
There is a collection of instructions available to Shaders which are dedicated to performing integer arithmetic and bitwise operations. Operands and output registers for integer instructions can be any of the register classes available to the floating point instructions. There is no data type associated with registers; Shader instructions determine how the data stored in registers is interpreted. Integer instructions simply assume that the data being read from operands and written to the destination are all 32-bit values (unsigned or signed 2's complement, depending on the instruction).
Shader register storage is made up of 32-bit*4-component quantities, and integer arithmetic on these registers is required to be performed at full 32 bit in all cases.
The bitwise instructions are listed in the Bitwise Instructions(22.11) sub-section of the full instruction listing.
See the Integer Arithmetic Instructions(22.12) sub-section of the full instruction listing.
There is no implicit conversion between floating-point and integer values. Contents of registers are interpreted as float or ints by the particular instruction being executed. Two instructions exist that allow explicit conversions to be performed, listed in the Type Conversion Instructions(22.13) sub-section of the full instruction listing.
Integer offsets for reads from register banks are available. These offsets must be scalar values (i.e. a select swizzle must be used to select one component of any vector-valued register used as an index) and are considered to be unsigned 32 bit values.
This indexing mechanism applied to indexable x# registers allows compilers to generate stack-like behavior for Shader subroutines.
An example syntax for indexing is:
mov r1, cb7[3+r2.x]
This instruction assumes that an unsigned 32-bit integer value exists in r2.x, and uses that value to offset into ConstantBuffer 7, starting from location 3 in the ConstantBuffer. Thus, if r2.x contains integer value 2, entry 5 of ConstantBuffer 7 would be referenced.
Floating point instructions must follow the D3D11.3 Floating Point Rules(3.1).
A listing of all floating point instructions can be found here(22.10).
Instructions are provided for rounding floating point values to integral floating point values:
round_ne(22.10.14) (nearest-even)
round_ni(22.10.15) (negative-infinity)
round_pi(22.10.16) (positive-infinity)
round_z(22.10.17) (towards zero)
The D3D intermediate language (IL) and register model are 4-vec oriented. Since this does not constrain hardware implementation (vector vs scalar) too much, this convention will carry forward until a good reason to switch paradigms surfaces. It is known that many implementations actually happen to operate on scalars or combinations of layouts even now.
One area where the vector assumption seems to materially impact data organization is the indexing of registers such as inputs or outputs – the indexing happens across registers. If it is important to be able to express cleanly how to index through an array of scalars, this could be an example of an argument for switching the IL to be completely scalar.
Section Contents
(back to chapter)
7.11.1 Overview
7.11.2 Index Range
7.11.3 Constant Buffer Indexing Example
7.11.4 Resource/Buffer Indexing Example
7.11.5 Sampler Indexing Example
7.11.6 Resource Indexing Declarations
Shaders have bindpoint arrays for various classes of read-only input resources: Constant Buffers (cb), Texture/Buffers (t), Samplers (s).
D3D11 allows all of these to be dynamically but uniformly indexed from a shader, whereas previously none of them were indexable.
As with indexing of other types, such as indexable temps (x#), the dynamic index can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or the combination of the two, added together.
The constraint on the indexing of resources or samplers is that the index must be uniform. That is, the computed index must be the same at that point in the lockstep execution of the program for all invocations of the shader within the Draw*() call. If due to flow control, some of the lockstep shader invocations are inactive, the computed index in those shaders is ignored and therefore cannot cause a violation of the uniform indexing constraint on all the active invocations.
The HLSL compiler will enforce this behavior and driver compilers must not break it either. Violations of the uniform indexing constraint would be a result of an HLSL compiler bug or a driver compiler bug only, and in such cases the indexing results are undefined.
Out of bounds resource indexing produces the same result as if accessing a slot with no resource bound.
In particular note that with Constant Buffers, there are 14 API-visible Constant Buffer slots (a couple of other slots are reserved for various purposes). The valid indexing range for Constant Buffers is therefore [0..13], and accesses out of that range behave as if accessing a slot with no Constant Buffer bound.
Out of bounds indexing of the Samplers (s#) results in undefined behavior.
Suppose x3[0].x contains 4 and x4[2].y contains 5. The following mov instruction:
mov r0, cb[x3[0].x+6][x4[2].y+9]
is therefore equivalent to:
mov r0, cb[10][14]
which means read a 32-bit * 4-vector from location [14] in the ConstantBuffer, at ConstantBuffer bind point [10] (0-based counting).
The uniform dynamic indexing of which Constant Buffer to read from is what was not supported previously. Dynamic indexing within the Constant Buffer itself has always been supported.
Suppose x3[0].x contains 4. The following ld instruction:
ld r0, r1, t[x3[0].x+6], texture2D
is equivalent to:
ld r0, r1, t[10], texture2D
Note the "texture2D" at the end is also a new requirement, whereby all ld/sample instructions will indicate which Shader Resource View type is to be sampled.
Suppose x3[0].x contains 4 and x4[2].y contains 5. The following sample instruction:
sample r0, r1, t[x3[0].x+6], s[x4[2].y+9], textureCubeArray
is equivalent to:
sample r0, r1, t[10], s[14], textureCubeArray
Shader declarations from Shader Model 4.x for individual resources, constant buffers and samplers remain the same in Shader Model 5.0. These are particularly informative for parts of shader code that reference these objects directly, just as before.
However, all instructions that reference texture objects (t#) now specify the view dimension (e.g. textureCubeArray) as a literal parameter. This is redundant when indexing is not used, since the up-front declaration of each t# has a view dimension, but useful when indexing is used.
A flow control block is defined as an if(22.7.1) block, loop(22.7.4) block, or switch(22.7.18) block. Flow control blocks can nest up to 64 deep per subroutine (and main). Behavior of flow control instructions beyond this nesting limit is undefined.
Subroutines can nest up to 32 deep. If there are already 32 entries on the return address stack and a "call" is issued, the call is skipped over.
For Typed memory views, the number of components in an address when accessed by a shader instruction is determined by the number of components in the resource dimension. Each address component is an unsigned 32-bit integer element index.
For Raw memory views, the address is a single component unsigned 32-bit integer byte offset from the beginning of the view. The addresses must be 32-bit aligned. If an unaligned address is specified for an operation involving a write, the entire contents of the UAV(5.3.9) being written, or all of Thread Group Shared Memory (in the Compute Shader(18)) - whichever is being accessed - becomes undefined. If an unaligned address is specified for an operation involving a read, an undefined result is returned to the shader. It is invalid for implementations to perform the access as if there were no 32-bit alignment constraints.
For Structured memory views, the address is two unsigned 32-bit integer values. The first value is the struct index, and the second value is a byte offset into the struct. The byte offset must be aligned to 32-bits, otherwise the same behavior described for misaligned raw memory access above applies.
Each memory access instruction defines its behavior for out of bounds accesses, with distinctions for the memory location being accessed (UAV vs SRV vs Thread Group Shared Memory), and the layout (raw vs structured vs typed). See the documentation of individual instructions for details. The behaviors are similar for similar classes of instructions – e.g. all atomics have the same out of bounds behavior, all immediate atomics (which return a value to a shader) have their own consistent out of bounds access behavior, etc.
Section Contents
(back to chapter)
7.14.1 Intro
7.14.2 Atomicity
7.14.3 Sync
7.14.4 Global vs Group/Local Coherency on Non-Atomic UAV Reads
The types of memory accesses included in the scope of this chapter are: to Unordered Access Views(5.3.9) (UAVs, u#), available to the Compute Shader(18) and Pixel Shader(16), as well as Thread Group Shared Memory (g#), available to the Compute Shader.
The D3D11 Shader Memory Consistency Model is weak/relaxed, as generally understood in existing architectures and literature. Loosely, this means the program author and/or compiler are responsible for identifying all memory and thread synchronization points via some appropriately expressive labeling.
This section outlines how this weak/relaxed Memory Consistency Model appears to function from the point of view of D3D software.
An atomic operation may involve both reading from and then writing to a memory location. Atomic operations apply only to either u# (Unordered Access Views) or g# (Thread Group Shared Memory).
It is guaranteed that when a thread issues an atomic operation on a memory address, no write to the same address from outside the current atomic operation by any thread can occur between the atomic read and write.
If multiple atomic operations from different threads target the same address, the operations are serialized in an undefined order.
Atomic operations do not imply a memory or thread fence. Fence operations (dubbed "sync") are introduced below. If the program author/compiler does not make appropriate use of fences, it is not guaranteed that all threads see the result of any given memory operation at the same time, or in any particular order with respect to updates to other memory addresses.
Atomicity is implemented at 32-bit granularity. If a load or store operation spans more than 32-bits, the individual 32-bit operations are atomic, but not the whole.
Limitation: Atomic operations on Thread Group Shared Memory are atomic with respect to other atomic operations, as well as operations that only perform reads ("load"s). However atomic operations on Thread Group Shared Memory are NOT atomic with respect to operations that perform only writes ("store"s) to memory. Mixing of atomics and stores on the same Thread Group Shared Memory address without thread synchronization and memory fencing between them produces undefined results at the address involved. This limitation arises because some implementations of loads and stores do not honor the locking semantics for implementing atomics. It turns out this has no impact on loads, since they are guaranteed to retrieve a value either before or after an atomic (they will not retrieve partially updated values, given they are all defined at 32-bit quanta). However store operations could find their way into the middle of an atomic operation and thus have their effect possibly lost.
Note that there is no such limitation on atomics to UAV memory; atomic operations on UAV memory is atomic both with respect to other atomic operations as well as loads and stores.
A sync(22.17.7) instruction is included in the Shader IL for Pixel Shader and the Compute Shader.
This provides memory fence semantics at various scopes, and optional thread group synchronization semantics (the latter only applies to the Compute Shader). For details, including some discussion of the implications see the description of the sync(22.17.7) instruction.
Typical implementations will have a cache hierarchy to improve read access performance on UAV(5.3.9) accesses. A constraint that some implementations have with the first stage in this cache hierarchy is that, in addition to operating at per-thread-group scope only, the cache does not have an efficient way of being synchronized with writes or atomics that have happened by other thread groups. Such behavior only surfaces as an issue for applications when cross-thread-group communication needs to be performed involving data loads. In this case, the hardware basically needs to know that it must bypass the first stage of caches on loads, reaching out to a more global memory so that the cross thread-group communication can function. D3D allows applications specify this cross-thread-group communication intent as follows.
If a Compute Shader(18) thread in a given thread group needs to perform loads of data that was written by atomics or stores in another thread group, the UAV slot where the data resides must be tagged upon declaration in the shader as "globally coherent", so the implementation can ignore the local cache. Otherwise, this form of cross-thread group data sharing will produce undefined results.
Atomic read-modify-write operations do not have this constraint (even though a part of the operation is a read/load), because a byproduct of the hardware honoring atomicity is that the entire system sees the operation, whereas simple loads on some implementations may only go to a local cache that has no knowledge of external updates.
If a UAV is not declared as "globally coherent", it is only "group coherent", which means loads can only see data written by stores and atomics in other threads in the same thread group. The affected hardware knows it can make use of its thread-group specific caching for loads, since writes to the memory only came from the current thread group. A UAV tagged as "globally coherent" is also inherently obviously "group coherent", although the affected hardware would not use its local cache. As such, the "globally coherent" flag should only be specified when necessary.
As a reminder though, to guarantee coherency on UAV accesses on all implementations, not only must shaders make the global vs group scope distinction discussed here upon UAV declaration, but they must also make appropriate use of memory and/or thread barriers ("sync_*" in the IL) as needed within in the shader to enforce proper ordering of operations by individual threads as seen by others. In addition, the "sync" operation has options for memory barriers that also distinguish between global vs group scope, but that control is separate from the topic of this section, and may not be exposed until a later time, as discussed in the sync instruction definition.
Back to issue of global vs group coherency on non-atomic UAV reads. Importantly, for many scenarios where cross thread-group communication or reduction (such as histograms) can be accomplished using only atomic operations (no cross thread-group loads involved), there is no problem since atomic operations are implemented by all hardware in a globally coherent way, regardless of whether the UAV has been tagged as "globally coherent" or not.
In the Pixel Shader(16), if a UAV is not declared as "globally coherent", it is only "locally coherent". "Local coherency" is the Pixel Shader’s equivalent of the Compute Shader’s "group coherency", except having scope limited only to a single Pixel Shader invocation. This indicates that the Pixel Shader is not doing any cross-PS-invocation communication involving simple load operations. Note, however, that in the Pixel Shader just like in the Compute Shader, atomic read-modify-write operations are always globally coherent. Indeed it is likely to be rare for a Pixel Shader or perhaps even the Compute Shader to need to declare a UAV as "globally coherent", given that atomic operations, which are always globally coherent, might provide the most practical mechanism for cross-PS-invocation or cross-group operations.
To assist comparisons of algorithms running on GPUs during application development, a cycle counter can be read into shaders. The cycle counter is a 64-bit unsigned integer.
The cycle counter appears as an additional 2*32-bit (64 bit total) input register type that can declared in any version 5.0+ shader. There are currently no native 64-bit integer arithmetic operations in shaders, although it is simple enough to emulate this. It may be fine for shaders to just look at the low 32-bits of the counter – this can be requested in the shader. Applications may also export the measurements using standard shader outputs for later analysis such as on the CPU.
The counter is an implementation-dependent measure of cycles in the GPU engine, requiring care to interpret it usefully.
For this discussion, consider a shader "invocation" to be a single execution of one shader program from beginning to end. For the Compute Shader however, an "invocation" is a single thread-group’s execution – e.g. the lifespan of the contents of thread-group shared memory.
The initial value of the counter is undefined.
A single reading of the cycle counter is meaningless. But any shader invocation can poll the counter value any number of times.
Computing a delta from cycle counter readings within a shader invocation is meaningful.
Computing a delta from cycle counter readings across separate shader invocations is not meaningful on all hardware. Developers must obtain information directly from IHVs about whether this is meaningful.
The only IHV agnostic approach to interpreting the counters is to limit calculation of deltas to within a given shader invocation, and only make comparisons of deltas within or between shader invocations.
There are plenty of reasons why test runs will execute differently. The obvious one is that execution of a shader can be interrupted by thread switching, so delta measurements will be arbitrarily larger than the number of cycles spent executing instructions in a given thread.
There is no supported way to find out the frequency of the counter. There is no way to correlate this shader internal counter with external timers such as asynchronous time queries. The counter measurements cannot be correlated with measurements on different hardware by other hardware vendors or even necessarily the same vendor.
If a GPU’s speed changes, such as for power saving, there is no way to know this happened, or its effect on cycle measurements.
Beyond these hints about the care needed to interpret the counter, the onus is on developers to research the properties of new hardware designs that may affect measurements.
The HLSL shader compiler and driver compilers must treat reads of the cycle counter as barriers. Instructions can’t be moved across a counter read, and counter reads can’t be merged.
The runtime enforces that shaders using this feature can only be created on a system with debug layer enabled. The debug layer is not allowed to be redistributed to end-user machines. The point is that shaders that use this counter are not intended to be shipped.
This feature will not be tested on hardware by WHQL, except perhaps simply checking that drivers do not crash. Microsoft will test that the HLSL compiler output is correct.
A new input register, vCycleCounter(22.3.29), can be declared in any version 5_0 (and beyond) shader:
dcl_input vCycleCounter.{x|xy}.
Reading x yields the 32 LSBs of the 64-bit count, and reading y yields the 32 MSBs.
This register can only be used as the source to a mov instruction, e.g. mov r0.w, vCycleCounter.x.
Up to 128 Resources (e.g. Buffer, Texture1D/2D/3D/Cube) can be active per Pipeline stage. A Resource binding is a representation of a Resource's base pointer (and other data such as size and pixel layout) and is independent of the samplers.
A texture out of a set of bound textures cannot be selected via Shader indexing, however Texture1D/2D/3D resources with an Array dimension > 1, or TextureCube (which has an Array dimension of 6), allow indexing along the array axis from within Shader code.
Textures can only have a single Element format. Likewise, Buffers used as input to Shaders can also only have a single Element format, and have an implied data stride equal to the Element size. A single Buffer (or Texture) could be set to multiple input slots simultaneously, with different Element formats and/or offsets, however because Buffers bound as Shader inputs have their data stride implied by the Element format, it is not possible to describe "Array-of-Structures" style layouts in Buffers bound at Shader input. This unlike the Input Assembler Stage, where multiple element Buffers are permitted, and Element offsets and strides can be defined Buffers freely.
Data from textures is accessed in shaders via the load (ld) and sample instructions. The ld instruction provides a simple read and (optional) float32 conversion of texture data using integral addresses, while the sample instructions use normalized floating point addressing and perform filtering in addition to the format conversion.
The load operation performs a non-filtered read of resource data. See the ld(22.4.6) instruction definition for details.
Multisample resources can be set as shader inputs, which allows individual samples to be read by the shader. Support for multisample shader reads has the following restrictions:
See ld(22.4.6) and dcl_resource(22.3.12) definitions for details.
Section Contents
(back to chapter)
7.18.1 Overview
7.18.2 Samplers
7.18.3 Sampler State
7.18.4 Normalized-Space Texture Coordinate Magnitude vs. Maximum Texture Size
7.18.5 Processing Normalized Texture Coordinates
7.18.6 Reducing Texture Coordinate Range
7.18.7 Point Sample Addressing
7.18.8 Linear Sample Addressing
7.18.9 Texture Address Processing
This section describes the mechanics of sampling Texture1D/2D/3D/Cube resources using filtering. The simplest form of sampling a texture is point sampling, supported for all data formats, however more complex filtering operations are only available to some formats, indicated in the format list in the Formats(19.1) section.
The behaviors described here are obtained via the the various sample* instructions, such as sample(22.4.15). See the specs for those instructions for further details that complement this section.
Unless otherwise noted, all texture sampling address operations are performed according to the arithmetic processing rules described in the Basics(3) section.
Texture filtering theory or historical background is NOT provided in this spec.
Note that details of all required texture filtering algorithms are not fully/exactly specified for this version of D3D11.3; the specs below only explicitly define a subset of all filtering features available in D3D11.3.
Samplers identify filtering modes and other sampler state, described below. Samplers are not indexable from within shaders. There are 16 samplers "slots" per Pipeline stage, to which "Sampler Objects" can be arbitrarily assigned/reassigned.
The state for a sampler is encapsulated in a "sampler object", up to 4096 of which can be created through the API. At the time a sampler object is created, all of its state must be chosen permanently, and can never be changed. These sampler objects can be arbitrarily assigned to any of the 16 "sampler slots" at each of the Shader stages (a single sampler object is allowed to be assigned to multiple sampler slots, even on multiple pipelines stages simultaneously, if desired.
The reason Sampler Objects are statically created, and there is a limit on the number that can be created, is to enable hardware to maintain references to multiple samplers in flight in the Pipeline, without having to track changes or flush the Pipeline, which would be necessary if Sampler Objects were allowed to be edited.
typedef enum D3D11_FILTER { // Bits used in defining enumeration of valid filters: // bits [1:0] - mip: 0 == point, 1 == linear, 2,3 unused // bits [3:2] - mag: 0 == point, 1 == linear, 2,3 unused // bits [5:4] - min: 0 == point, 1 == linear, 2,3 unused // bit [6] - aniso // bit [7] - comparison // bits [8:7] - reduction type: // 0 == standard filtering // 1 == comparison // 2 == min // 3 == max // bit [31] - mono 1-bit (narrow-purpose filter) [no longer supported in D3D11] D3D11_FILTER_MIN_MAG_MIP_POINT = 0x00000000, D3D11_FILTER_MIN_MAG_POINT_MIP_LINEAR = 0x00000001, D3D11_FILTER_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000004, D3D11_FILTER_MIN_POINT_MAG_MIP_LINEAR = 0x00000005, D3D11_FILTER_MIN_LINEAR_MAG_MIP_POINT = 0x00000010, D3D11_FILTER_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000011, D3D11_FILTER_MIN_MAG_LINEAR_MIP_POINT = 0x00000014, D3D11_FILTER_MIN_MAG_MIP_LINEAR = 0x00000015, D3D11_FILTER_ANISOTROPIC = 0x00000055, D3D11_FILTER_COMPARISON_MIN_MAG_MIP_POINT = 0x00000080, D3D11_FILTER_COMPARISON_MIN_MAG_POINT_MIP_LINEAR = 0x00000081, D3D11_FILTER_COMPARISON_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000084, D3D11_FILTER_COMPARISON_MIN_POINT_MAG_MIP_LINEAR = 0x00000085, D3D11_FILTER_COMPARISON_MIN_LINEAR_MAG_MIP_POINT = 0x00000090, D3D11_FILTER_COMPARISON_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000091, D3D11_FILTER_COMPARISON_MIN_MAG_LINEAR_MIP_POINT = 0x00000094, D3D11_FILTER_COMPARISON_MIN_MAG_MIP_LINEAR = 0x00000095, D3D11_FILTER_COMPARISON_ANISOTROPIC = 0x000000d5, D3D11_FILTER_MINIMUM_MIN_MAG_MIP_POINT = 0x00000100, D3D11_FILTER_MINIMUM_MIN_MAG_POINT_MIP_LINEAR = 0x00000101, D3D11_FILTER_MINIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000104, D3D11_FILTER_MINIMUM_MIN_POINT_MAG_MIP_LINEAR = 0x00000105, D3D11_FILTER_MINIMUM_MIN_LINEAR_MAG_MIP_POINT = 0x00000110, D3D11_FILTER_MINIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000111, D3D11_FILTER_MINIMUM_MIN_MAG_LINEAR_MIP_POINT = 0x00000114, D3D11_FILTER_MINIMUM_MIN_MAG_MIP_LINEAR = 0x00000115, D3D11_FILTER_MINIMUM_ANISOTROPIC = 0x00000155, D3D11_FILTER_MAXIMUM_MIN_MAG_MIP_POINT = 0x00000180, D3D11_FILTER_MAXIMUM_MIN_MAG_POINT_MIP_LINEAR = 0x00000181, D3D11_FILTER_MAXIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT = 0x00000184, D3D11_FILTER_MAXIMUM_MIN_POINT_MAG_MIP_LINEAR = 0x00000185, D3D11_FILTER_MAXIMUM_MIN_LINEAR_MAG_MIP_POINT = 0x00000190, D3D11_FILTER_MAXIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR = 0x00000191, D3D11_FILTER_MAXIMUM_MIN_MAG_LINEAR_MIP_POINT = 0x00000194, D3D11_FILTER_MAXIMUM_MIN_MAG_MIP_LINEAR = 0x00000195, D3D11_FILTER_MAXIMUM_ANISOTROPIC = 0x000001d5 } D3D11_FILTER; typedef enum D3D11_TEXTURE_ADDRESS_MODE { D3D11_TEXADDRESS_WRAP = 1, D3D11_TEXADDRESS_MIRROR = 2, D3D11_TEXADDRESS_CLAMP = 3, D3D11_TEXADDRESS_BORDER = 4, D3D11_TEXADDRESS_MIRRORONCE = 5 } D3D11_TEXTURE_ADDRESS_MODE; typedef struct D3D11_SAMPLER_STATE { D3D11_FILTER Filter; D3D11_TEXTURE_ADDRESS_MODE AddressU; // U coordinate address mode D3D11_TEXTURE_ADDRESS_MODE AddressV; // V coordinate address mode D3D11_TEXTURE_ADDRESS_MODE AddressW; // W coordinate address mode float MinLOD; float MaxLOD; float MipLODBias; // (-16.0f..15.99f) DWORD MaxAnisotropy; // (0 - 16) D3D11_COMPARISON_FUNC ComparisonFunction; // for Percentage-Closer filter float BorderColor[4]; // R,G,B,A } D3D11_SAMPLER_STATE;
See the Sampler Declaration Statement(22.3.34) in the shader instruction reference for a description of which sampler states are honored depending on the choice of Filter setting, and a description of which sampler* instructions in the shader are permitted to reference samplers configured various ways.
The magnitude of normalized-space texture coordinates (allowing for texture tiling) has no effect on the maximum supportable texture dimensions that can be sampled. The only catch is that as the absolute magnitude of a normalized-space texture coordinate gets larger (e.g. large amounts of tiling), floating point dictates that less precision will be available to resolve individual texels in a given tiling of the texture being sampled. Large amounts of tiling of large dimension textures will yield sampling artifacts where float32 precision becomes inadequate. But separate from this tradoff, in order to otherwise achieve decoupling of the magnitude of normalized-space texture coordinates from having any effect on maximum texture dimension that can be sampled given float32 normalized-space addressing, a range reduction to about [-10...10], depending on the scenario, is applied on the texture coordinates.
Details of this range reduction are described later(7.18.6). The reduction happens before scaling texture coordinates by texture size, conversion to fixed point, and final application of Texture Address modes (CLAMP/MIRROR/WRAP etc.) on texel addresses. The range reduction allows the fixed point representation to not have to dedicate storage for the texture tiling. It is important to note that range reduction is a separate step from applying Texture Address mode (although the particular Texture Address mode affects what type of reduction gets used).
Using range reduction to decouple texture coordinate magnitude from supportable texture size has the following implication: The maximum texture dimension possible to be sampled in D3D11.3 is 2^17. This limit is derived starting with 24 bits of float32 fractional precision for the original texture coordinate, subtracting required subtexel precision (8 bits), and subtracting 1 more bit due to the factor of 2 scaling in the reduced range. Of course, the minimum upper limit for filterable texture dimension required to be exposed by all D3D11.3 implementations is far smaller, at only 16384 (see System Limits(21)).
This section describes in general how to convert a normalized texture coordinate to a texture address. The description is based on sampling a Texture1D, but applies equally to Texture2D and Texture3D (and not TextureCubes).
A normalized texture coordinate (U) maps the range [0, 1] to the range [0, numTexelsU], where numTexelsU is the size of a 1D texture in texels. The process of computing a texture address is as follows:
To limit the number of bits needed to store the texture coordinate in fixed point after conversion from floating point, the range of the normalized texture coordinate is reduced to be within [-10,10], depending on the Address mode. This removes the magnitude of texture tiling from the texture coordinate, while not affecting the behavior of texture address wrap modes. The same address mode handling can be applied to the range reduced texture coordinate as the original, producing the same result. The benefit is that the magnitude of texture tiling is not stored in the coordinate at the same time that texture size scaling is performed on the coordinate. This enables far larger texture coordinate range to be handled cleanly than would otherwise be possible without reduction.
Note that the range reductions applied here in some cases leave a bit of extra padding (up to [-10,10] mentioned). This padding allows for the fact that after scaling by texture size, the selection of texels for point or linear sample kernels involves picking texel(s) to the left and/or right of the sample location, so coordinates that are not near the boundaries of the addresing mode must not appear as if they are on the boundary. e.g. Consider Linear sampling a coordinate that straddles a border when in BORDER mode: this needs to pick up the Border Color for 1/2 of the samples and the interior edge of the texture for the other 1/2. However range reduction cannot just clamp to [0..1) for BORDER mode, because it would make coordinates that fall completely into BORDER territory incorrectly behave as if they straddle the border (picking up some contribution of Border Color and interior). Range reduction has to also allow for immediate texel offsets permitted in shader code Range reduction does not change expected texture sampling behavior; it just helps keep the sequence of floating point operations on texture coordinates within manageable range.
The following logic describes how normalized texture coordinate range reduction is performed. (This is different form final Texture Address Processing(7.18.9), which happens a couple of steps later, on scaled coordinates that identify texels.)
Given: float signedFrac(float f) returns (f - round_z(f)) // round_z : "round towards zero" float frac(float f) returns (f - round_ni(f)) // round_ni : "round towards negative infinity" We have:
float ReduceRange(float U, D3D11_TEXTURE_ADDRESS_MODE AddressMode) { switch (AddressMode) { case D3D11_TEXTURE_ADDRESS_WRAP: // The reduced range is [0, 1) return frac(U); case D3D11_TEXTURE_ADDRESS_MIRROR: // The reduced range is (-2, 2) return signedFrac(U/2) * 2; case D3D11_TEXTURE_ADDRESS_MIRRORONCE: case D3D11_TEXTURE_ADDRESS_CLAMP: case D3D11_TEXTURE_ADDRESS_BORDER: // The reduced range is [-10, 10]. // Each of these modes might use different tightnesses of reduced range, // but since there really is no benefit in that, a one-size-fits-all // approach is taken here. // Note that the range leaves room for immediate texel-space offsets // supported by sample instructions, [-8...7], // preventing these offsets from causing texcoords that clearly should // be out of range (i.e. in border/clamp region) from falling within // range after range reduction. The point is that range reduction does // not have an affect on the texels that are supposed to be chosen. if(U <= -10) return -10; else if(U >= 10) return 10; else return U; } return 0; }
Note that the amount of padding supported here for mirroronce/clamp/border are only feasible for use with point or linear filtering of a texture (a larger kernel becomes more likely to expose the reduced range boundary), including with immediate texel offsets from the shader. Furthermore, complex filters which use point or linear filter taps as building blocks (key example being Anisotropic Texture Filtering) are perfectly compatible with the specified range reduction. The reason is that such filters choose their "taps" by perturbing normalized texture coordinates (e.g. walking the line of anisotropy in Anisotropic Texture Filtering), and thus each pertubed "tap" individually goes through the range reduction described here before application of the usual Point/Linear Sample Addressing logic and Texture Address Processing described below.
Setting aside how sampler state is configured and how mipmap LOD is chosen, consider simply the task of point sampling an Element from a particular miplevel of a Texture1D, given a scalar floating point texture coordinate in normalized space. In the Texture Coordinate Interpretation(3.3.3) section, there is a diagram illustrating generally how a 1D texture coordinates maps to a texel (not accounting for wrapping). Note from the "Texture Coordinate System" diagram shown that texel corners have integral coordinates in texel-space, and so texel centers are at half-units away from the corners. Point sampling selects the "nearest" texel based on the proximity of texel centers to the texture coordinate (keeping in mind that texel centers are at half-units):
For Texture2D and Texture3D Resources, the same rules apply independently on the other dimensions.
For TextureCube Resources, the following occurs:
Similar to the previous section, set aside how sampler state is configured and how mipmap LOD is chosen for now, and consider simply the task of linear sampling an Element from a particular miplevel of a Texture1D, given a scalar floating point texture coordinate in normalized space. Linear sampling in 1D selects the nearest two texels to the sample location and weights the texels based on the proximity of the sample location to them.
texelFetch(tFloorU) * wFloorU + texelFetch( tCeilU) * wCeilU
The procedure described above applies to linear sampling of a given miplevel of a Texture2D as well:
texelFetch(tFloorU,tFloorV) * wFloorU * wFloorV + texelFetch(tFloorU, tCeilV) * wFloorU * wCeilV + texelFetch( tCeilU,tFloorV) * wCeilU * wFloorV + texelFetch( tCeilU, tCeilV) * wCeilU * wCeilV
Performing linear sampling of a miplevel of a Texture3D Resource extends the concepts described above to fetching of 8 texels.
In the case of a TextureCube, see the section regarding TextureCube Edge and Corner Handling(7.18.12)
The sample* instructions provide texture coordinates in normalized floating point form, such that values in [0..1] range span a given dimension of a texture, and values outside this range fall off the borders of the texture. Later in the filtering process, when individual texels are fetched, if the address is outside the extents of the texture, either the address gets mapped back into range by the texture address mode for each component, or the border-color is used. The texture address mode is defined by the AddressU, AddressV, and AddressW members of D3D11_SAMPLER_STATE.
Consider the moment in the process of sampling of a Texture1D just after picking a particular integer address scaledU to fetch a texel from (details on choosing sample locations described elsewhere for various filter modes). Suppose the texel address scaledU falls off the Texture1D, meaning either (scaledU < 0), or (scaledU > numTexelsU - 1), where numTexelsU is the count of texels in the U dimension of the Texture1D. The following pseudocode describes how the setting on D3D11_SAMPLER_STATE member AddressU gets applied on scaledU:
if ((scaledU < 0) || (scaledU > numTexelsU-1)) { switch (AddressU) { case D3D11_TEXADDRESS_WRAP: scaledU = scaledU % numTexelsU; if(scaledU < 0) scaledU += numTexelsU; break; case D3D11_TEXADDRESS_MIRROR: { if(scaledU < 0) scaledU = -scaledU - 1; bool Flip = (scaledU/numTexelsU) & 1; scaledU %= numTexelsU; if( Flip ) // Odd tile scaledU = numTexelsU - scaledU - 1; break; } case D3D11_TEXADDRESS_CLAMP: scaledU = max( 0, min( scaledU, numTexelsU - 1 ) ); break; case D3D11_TEXADDRESS_MIRRORONCE: if(scaledU < 0) scaledU = -scaledU - 1; scaledU = max( 0, min( scaledU, numTexelsU - 1 ) ); break; case D3D11_TEXADDRESS_BORDER: // Special case: Instead of fetching from the texture, // use the Border Color(7.18.9.1). bUseBorderColor = true; break; default: scaledU = 0; } }
For Texture2D and Texture3D, all of the above modes apply to the V and W dimensions independently, based on AddressV and AddressW. If any single dimension selects Border Color, then the Border Color(7.18.9.1) is applied.
Border Color values are defined in the DDI via 4 floating point values (RGBA), in linear space. The Border Color used in filtering is snapped to the precision the hardware performs filtering at for the format.
Note that the only components of the BorderColor used by filtering hardware are the ones present in the resource format description.
For example, suppose the resource format is DXGI_FORMAT_R8_SNORM, and BorderColor is needed during a sample operation. In this case only the RED component of BorderColor is used, along with the appropriate format-specific defaults for the other components. The BorderColor (the red part in this case) is taken as floating-point data and clamped into the range of the format before filtering. In this case, the red part of the BorderColor is clamped to [-1.0f,1.0f] range before being used by the filtering hardware. From this point (entering the filtering hardware) onward, the fact that BorderColor is being used has no more behavioral effect.
Suppose the task at hand is to choose a mipmap level from a Resource, given a floating point LOD value. The choice of mipmap level is based on the particular choice of filter mode in the Sampler State(7.18.3); in which the possible choices are POINT and LINEAR. Anisotropic texture filtering uses LINEAR mipmap selection.
This section describes how LOD is computed as part of sample* instructions involving filtering.
bool ComputeAnisotropicLOD = (SamplerState.Filter == D3D11_FILTER_ANISOTROPIC) && IsTexture2D // Includes. 2D array. // Note: Implementations may choose to perform anisotropic texture // filtering for TextureCubes as well, however D3D11.3 does not require(7.18.13) // filtering of TextureCubes to behave any better than tri-linear filtering. bool ComputeIsotropicLOD = !ComputeAnisotropicLOD bool Magnifying = (clampedLOD <= 0)
float3 TC.uvw
float3 dX.uvw float3 dY.uvw
dX.uv = (AxisMajor*dX'.uv - TC'.uv*DerivativeMajorX)/(AxisMajor*AxisMajor) dY.uv = (AxisMajor*dY'.uv - TC'.uv*DerivativeMajorY)/(AxisMajor*AxisMajor)
if (IsTextureCube) { // multiplying by 0.5f to adjust for TextureCube coordinate system dX.uvw = 0.5f * dX.uvw * [NumTexelsAlongCubeSide,NumTexelsAlongCubeSide,0]; dY.uvw = 0.5f * dY.uvw * [NumTexelsAlongCubeSide,NumTexelsAlongCubeSide,0]; } else { dX.uvw = dX.uvw * [NumTexelsInUDimension,NumTexelsInVDimension,NumTexelsInWDimension]; dY.uvw = dY.uvw * [NumTexelsInUDimension,NumTexelsInVDimension,NumTexelsInWDimension]; }
Implicit ellipse coefficients: A = dX.v ^ 2 + dY.v ^ 2 B = -2 * (dX.u * dX.v + dY.u * dY.v) C = dX.u ^ 2 + dY.u ^ 2 F = (dX.u * dY.v - dY.u * dX.v) ^ 2Defining the following variables:
p = A - C q = A + C t = sqrt(p ^ 2 + B ^ 2)The new vectors may be then calculated as:
new_dX.u = sqrt(F * (t+p) / ( t * (q+t))) new_dX.v = sqrt(F * (t-p) / ( t * (q+t)))*sgn(B) // The paper says sgn(B*p), which appears to be incorrect. new_dY.u = sqrt(F * (t-p) / ( t * (q-t)))*-sgn(B) new_dY.v = sqrt(F * (t+p) / ( t * (q-t)))If w is nonzero, as when calculating LOD for a volume map, an orthogonal transformation must be used to calculate a pair of 2 dimensional vectors with the same lengths and inner angle prior to computing the correct Jacobian matrix. The following is the transformation implemented by the reference rasterizer:
orthovec = dX x (dX x dY) dX' = (|dX|, 0, 0) dY' = (dot(dY,dX) / |dX|, dot(dY,orthovec) / |orthovec|, 0)The following caveats also apply:
float lengthX = sqrt(dX.u*dX.u + dX.v*dX.v + dX.w*dX.w) float lengthY = sqrt(dY.u*dY.u + dY.v*dY.v + dY.w*dY.w) output.LOD = log2(max(lengthX,lengthY))
// Compute outputs: // (1) float ratioOfAnisotropy // (2) float anisoLineDirection // (3) float LOD // (For 1D Textures, dX.v and dY.v are 0, so all the // math below can be simplified) float squaredLengthX = dX.u*dX.u + dX.v*dX.v float squaredLengthY = dY.u*dY.u + dY.v*dY.v float determinant = abs(dX.u*dY.v - dX.v*dY.u) bool isMajorX = squaredLengthX > squaredLengthY float squaredLengthMajor = isMajorX ? squaredLengthX : squaredLengthY float lengthMajor = sqrt(squaredLengthMajor) float normMajor = 1.f/lengthMajor output.anisoLineDirection.u = (isMajorX ? dX.u : dY.u) * normMajor output.anisoLineDirection.v = (isMajorX ? dX.v : dY.v) * normMajor output.ratioOfAnisotropy = squaredLengthMajor/determinant // clamp ratio and compute LOD float lengthMinor if ( output.ratioOfAnisotropy > input.maxAniso ) // maxAniso comes from a Sampler state. { // ratio is clamped - LOD is based on ratio (preserves area) output.ratioOfAnisotropy = input.maxAniso lengthMinor = lengthMajor/output.ratioOfAnisotropy } else { // ratio not clamped - LOD is based on area lengthMinor = determinant/lengthMajor } // clamp to top LOD if (lengthMinor < 1.0) { output.ratioOfAnisotropy = MAX( 1.0, output.ratioOfAnisotropy*lengthMinor ) // lengthMinor = 1.0 // This line is no longer recommended for future hardware // // The commented out line above was part of the D3D10 spec until 8/17/2009, // when it was finally noticed that it was undesirable. // // Consider the case when the LOD is negative (lengthMinor less than 1), // but a positive LOD bias will be applied later on due to // sampler / instruction settings. // // With the clamp of lengthMinor above, the log2() below would make a // negative LOD become 0, after which any LOD biasing would apply later. // That means with biasing, LOD values less than the bias amount are // unavailable. This would look blurrier than isotropic filtering, // which is obviously incorrect. The output of this routine must allow // negative LOD values, so that LOD bias (if used) can still result in // hitting the most detailed mip levels. // // Because this issue was only noticed years after the D3D10 spec was originally // authored, many implementations will include a clamp such as commented out // above. WHQL must therefore allow implementations that support either // behavior - clamping or not. It is recommended that future hardware // does not do the clamp to 1.0 (thus allowing negative LOD). // The same applies for D3D11 hardware as well, since even the D3D11 specs // had already been locked down for a long time before this issue was uncovered. } output.LOD = log2(lengthMinor);
biasedLOD = output.LOD + MipLODBias; biasedLOD = biasedLOD + srcLODBias; // for sample_b only; must be per done pixel clampedLOD = max(MinLOD,(min(MaxLOD, biasedLOD)));The ordering of min/max guarantees that if MinLOD > MaxLOD, then MinLOD takes precedence. These min and max operations follow the Floating Point Rules(3.1), so NaN never gets propagated. A sampler state that specifies NaN for MinLOD or MaxLOD is invalid.
The selection of minification vs magnification occurs after LOD clamping.
Also note the independent Per-Resource Mipmap Clamping(5.8) feature, which is an optional additional clamp on the LOD like MinLOD above but specified at a resource level as opposed to a sample+shader-resource view level.
In some future D3D version, a better definition of magnification should be considered. For one, filtering should take into account the available mipmaps after clamping. Further, perhaps whenever the most detailed available mipmap is read, it should receive magnification filtering, while minification filtering would always be applied to any less detailed mips read in a given filter operation. Thus a given trilinear filter operation could be applying both magnification on one of the mips referenced simultaneously with minification filtering on the other before blending the mips together. This distinction becomes interesting if more compelling magnification filter types are ever introduced, particularly in avoiding discontinuities transitioning between minification and magnification.
Regarding MipLODBias: The valid range for MipLODBias in the sampler and srcLODBias in the sample_b(22.4.16) instruction are (-16.0f...15.99f). An implementation must support sufficient range for the LOD value before the application-defined MinLOD/MaxLOD/MipLODBias/srcLODBias equation above, such that if the calculated LOD before this equation is outside of the internally supported range and gets clamped (prior to applying application-defined MinLOD/MaxLOD), then the MipLODBias part of the equation (given any valid MipLODBias and srcLODBias value) must not cause the LOD to come back into the range that affects mip selection.
TextureCube filtering near Cube edges, where 2x2 (bilinear) filter taps would fall off a face are required to spill over by one texel row/column to the appropriate adjacent map.
At TextureCube corners, a linear combination of the three relevant samples is required. The ideal (reference) linear combination of the three samples in the corner case is as follows: Imagine flattening out the Cube faces at the corner, yielding 3 texels and a missing one. Apply bilinear weights on this virtual grid of 4 texels, and then divide the weight for the missing texel evenly amongst the 3 other texels. It is alternatively permissible for an implementation to, instead of dividing the weight evenly amongst the 3 other texels, just split the weight of the missing texel across the 2 adjacent texels. However in future versions of D3D, only the reference behavior will be permitted.
Anisotropic texture filtering on a TextureCube does not have specified/required behavior except that it must at least behave no "worse" than tri-linear filtering would.
The application is given control over the return type of texture load instructions (i.e. reading raw integer values vs. reading normalized float values) by simply choosing an appropriate format to interpret the resource's contents as. See the Formats(19.1) section for detail.
For details on comparison filtering, see the sample_c(22.4.19) and sample_c_lz(22.4.20) instructions.
Comparision Filtering is an attempt by D3D11.3 to define basic building-block filtering operation that is useful for Percentage Closer Depth Filtering.
D3D9 never officially supported dedicated hardware support for shadow map scenarios. Namely, D3D9 does not spec the ability to bind a depth buffer as a shader input and to sample from it using comparision filtering (also known as "Percentage Closer Filtering"). Even though this never made it into the D3D9 spec, the D3D9 runtime intentionally used loose validation to enabled IHVs to align on a convention for how to make the feature work.
In the meantime, the D3D10+ hardware spec added a requirement for supporting binding depth as a texture and for comparison filtering.
As more scenarios arise involving the D3D11+ APIs running on Feature Level 9.x it finally makes sense to expose the D3D9 shadow buffer support. It turns out this is possible simply by loosening validation on existing API constructs in the D3D11.1+ API for depth buffers and comparision filtering, mapping to the equivalent on the D3D9 convention IHVs had aligned on where applicable.
When Feature Level 9.x is used at the D3D11.1+ API (meaning the D3D9 DDI is used) on a Win8+ driver, regardless of hardware feature level, applications can do the following:
The overbearing validation described above (dropping Draw calls when state is invalid) helps ensure that an application that can get shadows working at Feature Level 9.x will behave the same if the Feature Level is bumped up to 10+ with no code change required.
The reason this feature is limited to Win8+ drivers (regardless of hardware feature level) is to avoid having to test on any old D3D9 hardware that is unlikely to be driven by the D3D11.1 APIs in the first place.
The D3D11.1 runtime maps this shadow scenario to the D3D9 DDI (regardless of hardware feature level) as follows.
This feature was added too late to enforce via hardware conformance kit testing. However all hardware vendors at the time of shipping agreed to support it, and tests are being authored to assist with basic verification (even if not enforced for now).
The D3D11 CheckFeatureSupport() API has a new capability that can be checked: D3D11_FEATURE_D3D9_SHADOW_SUPPORT. This is set to true if the driver is Win8+ (no need to ask the driver anything else).
On the other hand if the D3D11 CheckFeatureSupport() / CheckFormatSupport() APIs are used to query format support on the individual DXGI_FORMAT_* names described here, the runtime will NOT report support for any capabilities specific to the shadow buffer scenario. For example support for using DXGI_FORMAT_R16_UNORM as a texture is not reported on Feature Level 9.1/9.2 (though it is supported on 9.3, independent of the shadow scenario).
Not reporting shadow support on format caps queries was a simplification. It avoids conflicts where this depth scenario allows operations with format names that are not allowed in non-shadow cases, particularly for DXGI_FORMAT_R16_UNORM. It was not worth disambiguating the format caps reporting for this unique case. The bottom line is all an application needs to do is check the D3D11_FEATURE_D3D9_SHADOW_SUPPORT cap described above to know if the entire scenario will work.
During Texture Sampling(7.18), the amount of range required for selecting texels (after scaling normalized texture coordinates by texture size) is at least 216. This range is centered around 0.
The amount of subtexel precision required (after scaling texture coordinates by texture size) is at least 8-bits of fractional precision (28 subdivisions).
In mipmap selection, after conversion from float, at least 8-bits must represent the integer component of the LOD, and at least 8-bits must represent the fractional component of an LOD (28 subdivisions).
See the discussion in the Fixed Point Integers(3.2.4) section on how fixed point numbers should be defined and how it relates to texture coordinate precision.
All of the texture filtering operations in D3D11.3, when being performed on floating point formats (regardless of format width), are required to follow the D3D11.3 Floating Point Rules(3.1), with one exception: When a filter weight of 0.0 is encountered, NaN's or signed zeros may or may not be propagated from the source texture.
Texture filtering operations performed on fixed point formats must be done with at least as much precision as the format.
Here are some general observations about things that can be expected of texture filtering operations.
Sampling from a slot with no texture bound returns 0 in all components.
Section Contents
(back to chapter)
7.19.1 Overview
7.19.2 Differences from 'Real' Subroutines
7.19.3 Subroutines: Non-goals
7.19.4 Subroutines - Instruction Reference
7.19.5 Simple Example
The programmable graphics pipeline has given software developers greatly enhanced flexibility and power. As a result, shader programming has evolved to the point where programmers need to combine multiple code building blocks (i.e. subroutines) on the fly. Current approaches generally cause the static creation of thousands of one-off shaders, each using a particular combination of subroutines to realize a specific effect. The use of flow control and looping can reduce the number of these precompiled combinations, but these techniques have a dramatic effect on the runtime performance of the shader code, and applications are still sensitive to the extra instructions and registers used in common shaders. Furthermore, since the shader programs are "kernels" or inner loops, any extra overhead for trying to reuse the same instruction stream to represent multiple combinations is more noticeable than in more traditional CPU code. The application developer has no way of knowing when it is safe, in regards to performance, to use flow control to mitigate code complexity. This leads to a different performance problem: dealing with of thousands of shaders.
The goal of this feature is to allow applications to have a simple, expressive programming model that abstracts away this combinatoric complexity while still achieving the performance of the custom precompiled shaders. To achieve this goal, we move the complexity from the application level to the driver level where hardware-specific knowledge can be utilized to reduce program size and complexity.
To satisfy the performance requirements of inner loop code, the overhead of calling conventions and lost optimizations needs to be addressed. Our method avoids the overhead by using a subroutine model that virtually "inlines" the functions that can be called. This is done by compiling code normally up to a call site, and then compiling all possible callees with the current state of the caller. The functions called would then be optimized for the current register state by mapping inputs and outputs to their current register locations. While this approach increases overall program size, it avoids the cost of both parameter passing and stack save/restore, thereby avoiding the overhead of traditional function calls while preserving runtime flexibility.
The IL ASM has code blocks that act and look like subroutines; there are defined in/out parameters and registers are all local (in/out/temp/scratch). Some global references remain: textures, constant buffers, and sampler. The main difference from normal subroutines is that each location that can call a subroutine has a declaration describing the call destinations that are possible.
The set of functions to call when executing a given shader program can be changed between draw calls when calling SetShader. When binding the shader program to the pipeline, the list of functions to use is specified. Selecting the set of functions to use between draw calls allows the driver to recalculate the hardware requirements for a specified set of functions. Calculating the true number of registers required for a given "specialization" of a shader provides the combined flexibility of choice at runtime and the performance of a specialized shader.
The primary difference of this approach from "real" subroutines is that at runtime no calling convention is used. Each time a function could be called, a version of the function is emitted to match the caller’s register and other state. Since a new version of the callee is emitted for each location in the caller code that the function is called from, all optimizations used when inlining apply, except that callee code must remain functionally separate from caller code.
Take an example: The main function has an fcall(22.7.19) instruction and that fcall instruction has two function implementations that could be called. When generating the microcode for the program to execute, the code is generated up to the fcall routine and the current state of the registers and other shader state is stored off in "StateBeforeCall". Then code is generated for the first function that can be called starting with the current state of register allocation, scratch registers, etc. Next the current state is restored to StateBeforeCall and the code for the second function is generated. Finally the current state is restored to StateBeforeCall again and the impacts of the outputs of the fcall are applied to the current state, and code generation continues after the fcall.
Limitations are present in the IL that allow for the calling destination to have a version of a function’s microcode emitted using the current register knowledge of the caller to allocate the callee’s local registers after the caller’s registers so that no saving/restoring of data is required when crossing the function boundary.
The downside from "real" subroutines is that the amount of code to represent the program can become quite large. No code sharing is done between multiple call sites. If code is larger than the code cache, and the miss latency is not hidden by some other mechanism, then "real" subroutines are very useful. Assuming that the code bloat size is minimal (i.e. each function is only ever called from one location), then performance will be better with the new method – no parameter passing overhead, inlining optimizations, etc.
Another problem with the new method is that all destinations must be known at compile time. Due to validation that is currently done, all calls will be need to be known. As that requirement is relaxed, "real" subroutines are a better way of handling late binding destinations.
HLSL requires that all texture and sampler parameters be rooted in some well-known global object so that the compiler can determine which texture or sampler index to use for a particular texture or sampler variable throughout the entire program. As fcalls constitute a late-binding boundary the compiler cannot easily track parameter identity and thus texture and sampler arguments to fcalls are not allowed. Note that when only concrete classes are used this isn’t a problem. Additionally, texture and sampler members of classes should be allowed, this limitation only applies to parameters to interface methods that are used with full fcall dispatch.
Also see the related topics Uniform Indexing of Resources and Samplers(7.11) as well as the this[](22.7.20) register.
interface Light { float3 Calculate(float3 Position, float3 Normal); }; class AmbientLight : Light { float3 Calculate(float3 Position, float3 Normal) { return AmbientValue; } float3 AmbientValue; }; class DirectionalLight : Light { float3 Calculate(float3 Position, float3 Normal) { float3 LightDir = normalize(Position - LightPosition); float LightContrib = saturate( dot( Normal, -LightDir) ); return LightColor * LightContrib; } float3 LightPosition; float3 LightColor; }; AmbientLight MyAmbient; DirectionalLight MyDirectional; float4 main (Light MyInstance, float3 CurPos: CurPosition, float3 Normal : Normal) : SV_Target { float4 Ret; Ret.xyz = MyInstance.Calculate(CurPos, Normal); Ret.w = 1.0; return Ret; }
// Function table for AmbientLight. dcl_function_body fb0 dcl_function_table ft0 = { fb0 } // Function table for DirectionalLight. dcl_function_body fb1 dcl_function_table ft1 = { fb1 } // main's MyMaterial parameter. dcl_interface fp0[1][1] = { ft0, ft1 }; // main shader code // call AmbientLight or DirectionalLight based on function pointer bound fcall fp0[0][0] mov o0.xyz, r0.xyzx mov o0.w, l(1.000000) ret // AmbientLight::Calculate label fb0 mov r0.w, this[0].y mov r1.x, this[0].x mov r0.xyz, cb[r1.x + 0][r0.w + 0].xyzx ret // DirectionalLight::Calculate label fb1 mov r0.w, this[0].y mov r1.xyz, this[0].xyxx add r1.yzw, v0.xxyz, -cb[r1.z + 0][r1.y + 0].xxyz dp3 r2.x, r1.yzwy, r1.yzwy rsq r2.x, r2.x mul r1.yzw, r1.yyzw, r2.xxxx dp3_sat r1.y, v1.xyzx, -r1.yzwy mul r1.xyz, r1.yyyy, cb[r1.x + 0][r0.w + 1].xyzx mov r0.xyz, r1.xyzx ret
//create the shader // and specify the class library to load class instance info into pDevice->CreatePixelShader(pShaderCode, pMyClassLinkage, &pMyPS); //get a handle to the MyDirectional and MyAmbient class instances // from the class library //the zero is an array index for when the variable is an array. pMyClassLinkage-> GetClassInstance(L"MyDirectional", 0, &pMyDirectionalLight); pMyClassLibrary-> GetClassInstance(L"MyAmbient", 0, & pMyAmbientLight); while (true) { // select either the MyDirectionalList or MyAmbient class if (DirectionalLighting) pDevice->PSSetShader(pMyPS, &pMyDirectionalLight, 1); else pDevice->PSSetShader(pMyPS, &pMyAmbientLight, 1); RenderScene(); }
The programming model for subroutines is an interface driven model. The interface provides the definition of the function tables that can be switched between efficiently. A level of data abstraction is also present to allow for swapping of both data and function pointers during SetShader calls. At SetShader time, an array of class instantiations is specified that correspond to the interfaces that are used by the shader. The shader reflection system specifies information for each entry in the required interface array. A runtime reflection API is required to be able to specify the class instance in a way that can be efficiently mapped by the runtime to function pointers for the driver calls to consume. The runtime API does not need to be complex, just a method of providing handles to class instances.
The runtime API has only one goal: Provide a handle to SetShader that can be efficiently used to specify to the driver what functions should be executed for a given shader bind. To achieve this goal, a collection of class information is required if the class instance handles are to be shared across multiple shaders i.e. between all shaders within an effect. When a shader is created, a ID3D11ClassLinkage is a new parameter that specifies where to add the class metadata to. If the same class library is specified to two shaders, then the same class instance handles are used when binding either shader. The collection of class metadata could be global to a given device, but that could become cumbersome when mixing large collection of shaders (i. e. keeping a middleware solution separate from another middleware solution).
interface ID3D11ClassLinkage : IUnknown { // PRIMARY FUNCTION - get a reference to an instance of a class // that exists in a shader. The common scenario is to refer to // variables declared in shaders, which means that a reference is // acquired with this function and then passed in on SetShader HRESULT GetClassInstance( WCHAR *pszClassInstanceName, UINT uInstanceIndex, ID3D11ClassInstance **pClassInstance); // Create a class instance reference that is the combination of a class // type and the location of the data to use for the class instance // - not the common scenario, but useful in case the data location // for a class is dynamic or not known until runtime HRESULT CreateClassInstance( WCHAR *pszClassTypeName, UINT ConstantBufferOffset, UINT ConstantVectorOffset, UINT TextureOffset, UINT SamplerOffset, ID3D11ClassInstance **pClassInstance); } // Specifying the calls in "10 speak". Use the follow as an example // of how one could retrofit D3D10 and then put that into the D3D11 API // i.e. ignoring split of Creats off of device, new stages, etc. Interface ID3D11Device { [ … Existing calls … ] // Shader create calls take a parameter to specify the class library // to append the class symbol information from the shader into // this is a NON-OPTIONAL parameter. A shader is unusable without // the funciton table information being used (assuming it has any) HRESULT CreateVertexShader( void *pShaderBytecode, SIZE_T BytecodeLength, ID3D11ClassLinkage *pClassLinkage, ID3D11VertexShader **ppVertexShader); HRESULT CreateGeometryShader( void *pShaderBytecode, SIZE_T BytecodeLength, ID3D11ClassLinkage *pClassLinkage, ID3D11VertexShader **ppVertexShader); HRESULT CreatePixelShader( void *pShaderBytecode, SIZE_T BytecodeLength, ID3D11ClassLinkage *pClassLinkage, ID3D11VertexShader **ppVertexShader); // Not shown: Similar to above for Hull Shader, Domain Shader and Compute Shader HRESULT CreateClassLinkage( ID3D11ClassLinkage **ppClassLinkage); // Shader bind calls take an extra array to specify the function tables // to use until the next bind shader call void VSSetShader( ID3D11VertexShader *pShader, ID3D11ClassInstance *ppClassInstances, UINT NumInstances); void GSSetShader( ID3D11GeometryShader *pShader, ID3D11ClassInstance *ppClassInstances, UINT NumInstances); void PSSetShader( ID3D11PixelShader *pShader, ID3D11ClassInstance *ppClassInstances, UINT NumInstances); // Not shown: Similar to above for Hull Shader, Domain Shader and Compute Shader }
interface Light { float3 Calculate(float3 Position, float3 Normal); }; class AmbientLight : Light { float3 m_AmbientValue; float3 Calculate(float3 Position, float3 Normal) { return m_AmbientValue; } }; class DirectionalLight : Light { float3 m_LightDir; float3 m_LightColor; float3 Calculate(float3 Position, float3 Normal) { float LightContrib = saturate( dot( Normal, -m_LightDir) ); return m_LightColor * LightContrib; } }; uint g_NumLights; uint g_LightsInUse[4]; Light g_Lights[9]; float3 AccumulateLighting(float3 Position, float3 Normal) { float3 Color = 0; for (uint i = 0; i < g_NumLights; i++) { Color += g_Lights[g_LightsInUse[i]].Calculate(Position, Normal); } return Color; } interface Material { void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord); float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord); }; class FlatMaterial : Material { float3 m_Color; void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord) { } float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord) { return m_Color * AccumulateLighting(Position, Normal); } }; class TexturedMaterial : Material { float3 m_Color; Texture2D<float3> m_Tex; sampler m_Sampler; void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord) { } float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord) { float3 Color = m_Color; Color *= m_Tex.Sample(m_Sampler, TexCoord) * 0.1234; Color *= AccumulateLighting(Position, Normal); return Color; } }; class StrangeMaterial : Material { void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord) { Position += Normal * 0.1; } float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord) { return AccumulateLighting(Position, Normal); } }; float TestValueFromLight(Light Obj, float3 Position, float3 Normal) { float3 Calc = Obj.Calculate(Position, Normal); return saturate(Calc.x + Calc.y + Calc.z); } AmbientLight g_Ambient0; DirectionalLight g_DirLight0; DirectionalLight g_DirLight1; DirectionalLight g_DirLight2; DirectionalLight g_DirLight3; DirectionalLight g_DirLight4; DirectionalLight g_DirLight5; DirectionalLight g_DirLight6; DirectionalLight g_DirLight7; FlatMaterial g_FlatMat0; TexturedMaterial g_TexMat0; StrangeMaterial g_StrangeMat0; float4 main ( Material MyMaterial, float3 CurPos: CurPosition, float3 Normal : Normal, float2 TexCoord : TexCoord0) : SV_Target { float4 Ret; if (TestValueFromLight(g_DirLight0, CurPos, Normal) > 0.5) { MyMaterial.Perturb(CurPos, Normal, TexCoord); } Ret.xyz = MyMaterial.CalculateLitColor(CurPos, Normal, TexCoord); Ret.w = 1; return Ret; }
// // This pointers are a four-element vector with indices for // which constant buffer holds the instance data (.x element), // the base offset of the instance data in the instance constant // buffer, the base texture index and the base sampler index. // Basic instance members will therefore be referenced with // cb[r0.x][r0.y + member_offset]. // This pointers can be in arrays so the first [] index // can also have a register to indicate array access. // // // For this example assume that globals are put in cbuffers // in the following order. Entries are offset:size in // register (four-component) units. // // cb0: // 0:1 - g_NumLights. // 1:4 - g_LightsInUse. // 5:1 - g_Ambient0. // 6:2 - g_DirLight0. // 8:2 - g_DirLight1. // 10:2 - g_DirLight2. // 12:2 - g_DirLight3. // 14:2 - g_DirLight4. // 16:2 - g_DirLight5. // 18:2 - g_DirLight6. // 20:2 - g_DirLight7. // 22:1 - g_FlatMat0. // 23:1 - g_TexMat0. // // g_StrangeMat0 takes no space. // // interfaces: // 0:1 - MyMaterial. // 1:9 - g_Lights. // // textures: // 0:1 - g_TexMat0. // // samplers: // 0:1 - g_TexMat0. // // The this pointers for the concrete objects would then be: // g_Ambient0: { 0, 5, -, - } // g_DirLight0: { 0, 6, -, - } // g_DirLight1: { 0, 8, -, - } // g_DirLight2: { 0, 10, -, - } // g_DirLight3: { 0, 12, -, - } // g_DirLight4: { 0, 14, -, - } // g_DirLight5: { 0, 16, -, - } // g_DirLight6: { 0, 18, -, - } // g_DirLight7: { 0, 20, -, - } // g_FlatMat0: { 0, 22, -, - } // g_TexMat0: { 0, 23, 0, 0 } // g_StrangeMat0: { -, -, -, - } // // // Function bodies are declared explicitly so // that it’s known in advance which bodies exist // and how many bodies there are overall. // dcl_function_body fb0 dcl_function_body fb1 dcl_function_body fb2 dcl_function_body fb3 dcl_function_body fb4 dcl_function_body fb5 dcl_function_body fb6 dcl_function_body fb7 dcl_function_body fb8 dcl_function_body fb9 dcl_function_body fb10 dcl_function_body fb11 // // Function tables work similarly to vtables for C++ except // that a table has an entry per call site for an interface // instead of per method. // // Function table for AmbientLight. // One call site in AccumulateLighting multiplied by three calls of // AccumulateLighting from CalculateLitColor. dcl_function_table ft0 { fb3, fb6, fb9 } // Function table for DirectionalLight. // One call site in AccumulateLighting multiplied by three calls of // AccumulateLighting from CalculateLitColor. dcl_function_table ft1 { fb4, fb7, fb10 } // Function table for FlatMaterial. // One call to Perturb in main and one call to CalculateLitColor in main. dcl_function_table ft2 { fb0, fb5 } // Function table for TexturedMaterial. // One call to Perturb in main and one call to CalculateLitColor in main. dcl_function_table ft3 { fb1, fb8 } // Function table for StrangeMaterial. // One call to Perturb in main and one call to CalculateLitColor in main. dcl_function_table ft4 { fb2, fb11 } // // Function table pointers. Each of these needs to bound before // the shader is usable. The idea is that binding gives // a reference to one of the function tables above so that // the method slots can be filled in. // The compiler will not generate pointers for unreferenced objects. // // A function table pointer has a full set of method slots to // avoid the extra level of indirection that a C++ pointer-to- // pointer-to-vtable representation would require (that would also // require that this pointers be 5-tuples). In the HLSL virtual // inlining model it's always known what global variable/input is // used for a call so we can set up tables per root object. // // Function pointer decls indicate which function tables are // legal to use with them. This also allows derivation of // method correlation information. // // The first [] of an interface decl is the array size. // If dynamic indexing is used the decl will indicate // that, as shown below. An array of interface pointers can // be indexed statically also, it isn’t required that // arrays of interface pointers mean dynamic indexing. // // Numbering of interface pointers takes array size into // account, so the first pointer after a four entry // array fp6[4][1] would be fp10. // // The second [] of an interface decl is the number // of call sites, which must match the number of bodies in // each table referenced in the decl. // // main's MyMaterial parameter. dcl_interface fp0[1][2] = { ft2, ft3, ft4 }; // g_Lights entries. dcl_interface_dynamicindexed fp1[9][3] = { ft0, ft1 }; // main routine. // TestValueFromLight is a regular routine and is inlined. // The Calculate reference inside of it is passed the concrete // instance DirLight0 so it is devirtualized and inlined. dp3_sat r0.x, v1.xyzx, -cb0[6].xyzx mul r0.yz, r0.xxxx, cb0[7].xxyx add r0.y, r0.z, r0.y mad_sat r0.x, cb0[7].z, r0.x, r0.y // The return of TestValueFromLight is tested. lt r0.x, l(0.500000), r0.x if_nz r0.x // The call to Perturb is a full fcall fcall fp0[0][0] mov r2.xyz, r0.xyzx mov r0.x, r0.w mov r0.y, r1.x else mov r2.xyz, v1.xyzx mov r0.xy, v2.xyxx endif // The call to CalculateLitColor is a full fcall. fcall fp0[0][1] mov o0.xyz, r1.xyzx mov o0.w, l(1.000000) ret // // Function bodies. // // FlatMaterial version of main's call to Perturb. label fb0 mov r0.xyz, v1.xyzx mov r0.w, v2.y mov r1.x, v2.x ret // TexturedMaterial version of main's call to Perturb. label fb1 mov r0.xyz, v1.xyzx mov r0.w, v2.x mov r1.x, v2.y ret // StrangeMaterial version of main's call to Perturb. // NOTE: Position is not used later so the compiler has killed // the update to Position from this body. label fb2 mov r0.xyz, v1.xyzx mov r0.w, v2.x mov r1.x, v2.y ret // AmbientLight version of FlatMaterial.CalculateLitColor-calls- // AccumulateLighting's call to Calculate. // NOTE: the Calculate bodies all look superficially // identical but all are different. In one case // the array index is r1 and the return value is r4, // in one case the array index is r1 and the return value // is r5 and in the last case the array index is in r0 // and the return is in r5. Bodies are not interchangeable. label fb3 // Array index is r1, return is r4. mov r2.w, this[r1.w + 1].y mov r1.w, this[r1.w + 1].x mov r4.xyz, cb[r1.w + 0][r2.w + 0].xyzx ret // DirectionalLight version of FlatMaterial.CalculateLitColor-calls- // AccumulateLighting's call to Calculate. label fb4 // Array index is r1, return is r4. mov r2.w, this[r1.w + 1].y mov r3.w, this[r1.w + 1].x mov r4.w, this[r1.w + 1].y mov r5.x, this[r1.w + 1].x dp3_sat r4.w, r2.xyzx, -cb[r5.x + 0][r4.w + 0].xyzx mul r5.xyz, r4.wwww, cb[r3.w + 0][r2.w + 1].xyzx mov r4.xyz, r5.xyzx ret // FlatMaterial version of main's call to CalculateLitColor. label fb5 // AccumulateLighting is inlined. mov r3.xyz, l(0,0,0,0) mov r0.w, l(0) loop // g_NumLights is cb0[0]. uge r1.w, r0.w, cb0[0].x breakc_nz r1.w // Get g_Lights[g_LightsInUse[i]]. // g_LightsInUse is cb0[1-4]. // g_Lights is cb0[5-13]. mov r1.w, cb0[r0.w + 1].x // Call Calculate. Array index is r1. fcall fp1[r1.w + 0][0] // Return is expected in r4. mov r0.xyz, r4.xyzx add r3.xyz, r3.xyzx, r0.xyzx iadd r0.w, r0.w, l(1) endloop // Multiply times color. mov r0.xy, this[0].yxyy mul r0.xyz, r3.xyzx, cb[r0.y + 0][r0.x + 0].xyzx mov r1.xyz, r0.xyzx ret // AmbientLight version of TexturedMaterial.CalculateLitColor-calls- // AccumulateLighting's call to Calculate. label fb6 // Array index is r1, return is r5. mov r2.w, this[r1.w + 1].y mov r1.w, this[r1.w + 1].x mov r5.xyz, cb[r1.w + 0][r2.w + 0].xyzx ret // DirectionalLight version of TexturedMaterial.CalculateLitColor-calls- // AccumulateLighting's call to Calculate. label fb7 // Array index is r1, return is r5. mov r2.w, this[r1.w + 1].y mov r3.w, this[r1.w + 1].x mov r4.w, this[r1.w + 1].y mov r5.w, this[r1.w + 1].x dp3_sat r4.w, r2.xyzx, -cb[r5.w + 0][r4.w + 0].xyzx mul r6.xyz, r4.wwww, cb[r3.w + 0][r2.w + 1].xyzx mov r5.xyz, r6.xyzx ret // TexturedMaterial version of main's call to CalculateLitColor. label fb8 // Texture sample. mov r4.xy, this[0].zw sample r0.xyz, v2.xy, t[r4.x].xyz, s[r4.y] mul r0.xyz, r0.xyzx, l(0.123400, 0.123400, 0.123400, 0.000000) // m_Color multiplied by texture sample. mov r0.w, this[0].y mov r1.w, this[0].x mul r0.xyz, r0.xyzx, cb[r1.w + 0][r0.w + 0].xyzx // AccumulateLighting is inlined. mov r4.xyz, l(0,0,0,0) mov r0.w, l(0) loop // g_NumLights is cb0[0]. uge r1.w, r0.w, cb0[0].x breakc_nz r1.w // Get g_Lights[g_LightsInUse[i]]. // g_LightsInUse is cb0[1-4]. // g_Lights is cb0[5-13]. mov r1.w, cb0[r0.w + 1].x // Call Calculate. Array index is in r1. fcall fp1[r1.w + 0][1] // Return is expected in r5. mov r3.xyz, r5.xyzx add r4.xyz, r4.xyzx, r3.xyzx iadd r0.w, r0.w, l(1) endloop // Multiply accumulated color times texture color. mul r0.xyz, r0.xyzx, r4.xyzx mov r1.xyz, r0.xyzx ret // AmbientLight version of StrangeMaterial.CalculateLitColor-calls- // AccumulateLighting's call to Calculate. label fb9 // Array index is r0, return is r5. mov r1.w, this[r0.w + 1].y mov r0.w, this[r0.w + 1].x mov r5.xyz, cb[r0.w + 0][r1.w + 0].xyzx ret // DirectionalLight version of StrangeMaterial.CalculateLitColor-calls- // AccumulateLighting's call to Calculate. label fb10 // Array index is r0, return is r5. mov r1.w, this[r0.w + 1].y mov r2.w, this[r0.w + 1].x mov r3.w, this[r0.w + 1].y mov r4.w, this[r0.w + 1].x dp3_sat r3.w, r2.xyzx, -cb[r4.w + 0][r3.w + 0].xyzx mul r6.xyz, r3.wwww, cb[r2.w + 0][r1.w + 1].xyzx mov r5.xyz, r6.xyzx ret // StrangeMaterial version of main's call to CalculateLitColor. label fb11 // AccumulateLighting is inlined. mov r4.xyz, l(0,0,0,0) mov r0.z, l(0) loop // g_NumLights is cb0[0].x. uge r0.w, r0.z, cb0[0].x breakc_nz r0.w // Get g_Lights[g_LightsInUse[i]]. // g_LightsInUse is cb0[1-4]. // g_Lights is cb0[5-13]. mov r0.w, cb0[r0.z + 1].x // Call Calculate. Array index is in r0. fcall fp1[r0.w + 0][2] // Return is in r5. mov r3.xyz, r5.xyzx add r4.xyz, r4.xyzx, r3.xyzx iadd r0.z, r0.z, l(1) endloop mov r1.xyz, r4.xyzx ret
// create a class library to hold class instance data pDevice->CreateClassLinkage(&pMyClassTable); // create the shader and supply a class library to add class instance data pDevice-> CreatePixelShader(pMyCompiledPixelShader, pMyClassLinkage, &pMyPS); // use reflection to get where data should be stored in interface array NumInterfaces = pMyPSReflection->GetNumInterfaces(); pMyLightsVar = pMyPSReflection->GetVariableByName("g_Lights"); iLightOffset = pMyLightsVar->GetInterfaceSlot(0); pMyMaterialVar = pMyPSReflection->GetVariableByName("$MyMaterial"); iMatOffset = pMyPSReflection->GetInterfaceSlot(0); // Use class library to get references to all class instances // needed in the shader. pMyClassTable->GetClassInstance("g_Ambient0", 0, &pAmbient0); pMyClassTable->GetClassInstance("g_DirLight0", &pDirLight[0]); pMyClassTable->GetClassInstance("g_DirLight1", &pDirLight[1]); pMyClassTable->GetClassInstance("g_DirLight2", &pDirLight[2]); pMyClassTable->GetClassInstance("g_DirLight3", &pDirLight[3]); pMyClassTable->GetClassInstance("g_DirLight4", &pDirLight[4]); pMyClassTable->GetClassInstance("g_DirLight5", &pDirLight[5]); pMyClassTable->GetClassInstance("g_DirLight6", &pDirLight[6]); pMyClassTable->GetClassInstance("g_DirLight7", &pDirLight[7]); pMyClassTable->GetClassInstance("g_FlatMat0", &pFlatMat0); pMyClassTable->GetClassInstance("g_TexMat0", &pTexMat0); pMyClassTable->GetClassInstance("g_StrangeMat0", &pStrangeMat0); // sets lights in array - they do not change only indices to them do pMyInterfaceArray[iLightOffset] = pAmbient0; for (uint i = 0; i < 8; i++) { pMyInterfaceArray[iLightOffset + i + 1] = pDirLight[i]; } while (true) { if (bFlatSunlightOnly) { // Set g_NumLights to 1 in constant buffer. // Set g_LightsInUse[0] to 1 in constant buffer. pMyInterfaceArray[iMatOffset] = pFlatMat0; } else if (bStrangeMaterials) { // Set g_NumLights and fill out g_LightsInUse. pMyInterfaceArray[iMatOffset] = pStrangeMat0; } else { // Set g_NumLights and fill out g_LightsInUse. pMyInterfaceArray[iMatOffset] = pTexMat0; } // Set the pixel shader and the interfaces to until the next bind call pDevice->PSSetShader(pMyPS, pMyInterfaceArray, NumInterfaces); // Use the shader that was just bound to draw something RenderScene(); }
Section Contents
(back to chapter)
7.20.1 Overview
This adds support for 10bit (2.8 fixed point) and 16bit precision float and in some cases limited integer arithmetic to shader model 2.0+.
Shader<->memory I/O operations are unchanged for simplicity, e.g. shader constants continue to be defined as 32-bit per component.
Implementations are allowed to execute low precision operations at higher precision. So 10-bit arithmetic could be done at 10-bits or more (say 32-bit) precision.
The new 10 and 16 bit precision levels for shaders are inspired by their existence in some real hardware and their presence in OpenGL ES. (8 bit was considered but cut due to its limitations versus the value it seemed to provide at the time).
Default Precision | Min 10-bit fixed point (2.8) | Min 16-bit int / float | 32-bit int/float | 64-bit float | |
---|---|---|---|---|---|
Executing at higher precision allowed? | - | Y | Y | N | N |
Shader Constants | - | N | N | Y | Y |
SM 2.x | VS: fp32 / int23 PS: fp24 (s16e7) / int 16 | opt | opt | N | N |
SM 3.0 | fp32 | N | N | Y | N |
SM 4.x | fp32 / int32 | opt | opt | Y | opt |
SM 5.0 | fp32 / int32 | opt | opt | Y | opt |
Float range | - | [-2,2) | [-214,214] | Full IEEE 754 | Full IEEE 754 |
Float magnitude range | - | 2-8...2 | On SM 4+, includes INF/NAN | Full IEEE 754 | Full IEEE 754 |
Int range | - | - | (-211,211), Full range signed and unsigned on SM4+ | full | - |
This is a 2.8 fixed point value, though the fixed point semantics may not be identical to the general fixed point semantics defined in the D3D10+ specs. Following the D3D10+ fixed point semantics is recommended for future hardware that may choose to implement the 10-bit precision level.
8-bit UNORM data is invertable when passed through 10-bit min-precision storage. For example: Suppose UNORM 8-bit data that is point sampled from the texture format DXGI_FORMAT_R8G8B8A8_UNORM gets read into a shader and is stored and passed around in the 10-bit representation. If that data s subsequently written unchanged out to a UNORM 8-bit output (such as a DXGI_FORMAT_R8G8B8A8_UNORM rendertarget) the output UNORM value matches the input UNORM value. This guarantee does not (cannot) apply for other formats passing through 10-bit, such as 8-bit UNORM_SRGB or higher precision UNORM values like 16-bit UNORM.
From the shader point of view the 10-bit min-precision level this appears as a float value with at minimum [-2,2) range.
Hardware that supports 10-bit precision must also support 16-bit precision.
For float values, this is float 16 as defined in the D3D10+ specs. The exception is that for Shader Models 2, the max. exponent encoding (normally defining NaN/INF) are unused (undefined).
Conversion from float32 (e.g. from shader constants) to float16 may or may not flush float16 denorm to 0, and round to zero is used, per D3D spec for high to low precision float. Float16 arithmetic operations within the shader may or may not flush float16 denorm to 0, and may either round to nearest even or truncate to a representable number. Out of range values in conversion from float32 or arithmetic may produce +/-MAX_FLOAT16 or +/- INF.
16-bit integer min-precision is available as well in HLSL. For Shader Models 2, this is constrained to be representable as integral floats (1.0f, 2.0f, etc.) in a float16 encoding. In the shader bytecode these appear simply as float16, so native integer operations are not available. (it may not be worth bothering to expose this constrained form of int16 for SM 2/3)
For shader model 4+, native integer ops can be used on 16-bit min-precision values, however applications must beware that the device could choose to simply use larger-than-16-bit (e.g. 32 bit) integer ops without any clamping to maintain the illusion that there are not more than 16 bits present.
Shader Constants feeding 16-bit shader arithmetic are always fp32 encoded for Shader Model 2. For Shader Models 4+, Shader Constants feeding 16-bit in the shader are specified as float32 or UINT32/INT32 as appropriate (i.e. unchanged from the way constants feed into float32 arithmetic).
A new MIN_PRECISION enum is added to the source and dest parameter token, definition below. This specifies the minimum precision level for the entire operation – implementations can use equal or greater precision. This new enum co-exists with the PARTIALPRECISION flag that is already in the same dest parameter token – see the comment below.
// Source or dest token bits [15:14]: #define D3D11_SB_OPERAND_MIN_PRECISION_MASK 0x0001C000 #define D3D11_SB_OPERAND_MIN_PRECISION_SHIFT 14 typedef enum _D3DSHADER_MIN_PRECISION { D3DMP_DEFAULT = 0, // Default precision for the shader model D3DMP_16 = 1, // Min 16 bit per component D3DMP_2_8 = 2, // Min 10 bits (2.8) per component } D3DSHADER_MIN_PRECISION; // When MIN_PRECISION is nonzero on a dest token, the dest modifier // D3DSPDM_PARTIALPRECISION must also be set for consistency // // If D3DSPDM_PARTIALPRECISION is set but // D3DSHADER_MIN_PRECISION is D3DMP_DEFAULT(0), // it is equivalent to D3DSPDM_PARTIALPRECISION + D3DMP_16 // (partial PARTIALPRECISION existed before MIN_PRECISION was // added, so this defines how the two can coexist without changing // meaning for old shaders)
The src/dest token for instructions in PS/VS 2.x can use the MIN_PRECISION enum in the following circumstances:
A new MIN_PRECISION enum is added to the dest parameter token, definition below. This specifies the minimum precision level for the entire operation – implementations can use equal or greater precision.
The encoding distinguishes type (e.g. float vs. sint vs. uint), in addition to precision level, to disambiguate instructions like “mov” that don’t already imply a type. This makes a difference when there is a size change involved in the instruction. E.g. moving a 32 bit float to a min. 16 bit float is a different task for hardware than moving a 32 bit uint to a min. 16 bit uint. This type distinction is not needed for the D3D9 shader bytecode because all arithmetic is “float” there.
// Min precision specifier for source/dest operands. This // fits in the extended operand token field. Implementations are free to // execute at higher precision than the min – details spec’d elsewhere. // This is part of the opcode specific control range. typedef enum D3D11_SB_OPERAND_MIN_PRECISION { D3D11_SB_OPERAND_MIN_PRECISION_DEFAULT = 0, // Default precision // for the shader model D3D11_SB_OPERAND_MIN_PRECISION_FLOAT_16 = 1, // Min 16 bit/component float D3D11_SB_OPERAND_MIN_PRECISION_FLOAT_2_8 = 2, // Min 10(2.8)bit/comp. float D3D11_SB_OPERAND_MIN_PRECISION_SINT_16 = 4, // Min 16 bit/comp. signed integer D3D11_SB_OPERAND_MIN_PRECISION_UINT_16 = 5, // Min 16 bit/comp. unsigned integer } D3D11_SB_OPERAND_MIN_PRECISION; #define D3D11_SB_OPERAND_MIN_PRECISION_MASK 0x0001C000 #define D3D11_SB_OPERAND_MIN_PRECISION_SHIFT 14 // DECODER MACRO: For an OperandToken1 that can specify // a minimum precision for execution, find out what it is. #define DECODE_D3D11_SB_OPERAND_MIN_PRECISION(OperandToken1) ((D3D11_ SB_OPERAND_MIN_PRECISION)(((OperandToken1)& D3D11_SB_OPERAND_MIN_PRECISION_MASK)>> D3D11_SB_OPERAND_MIN_PRECISION_SHIFT)) // ENCODER MACRO: Encode minimum precision for execution // into the extended operand token, OperandToken1 #define ENCODE_D3D11_SB_OPERAND_MIN_PRECISION(MinPrecision) (((MinPrecision)<< D3D11_SB_OPERAND_MIN_PRECISION_SHIFT)& D3D11_SB_OPERAND_MIN_PRECISION_MASK) // ---------------------------------------------------------------------------- // Global Flags Declaration // // OpcodeToken0: // ... snip ... // [16:16] Enable minimum-precision data types ... snip ... // // OpcodeToken0 is followed by no operands. // // ---------------------------------------------------------------------------- ... snip ... #define D3D11_1_SB_GLOBAL_FLAG_ENABLE_MINIMUM_PRECISION (1<<16) ... snip ... // DECODER MACRO: Get global flags #define DECODE_D3D10_SB_GLOBAL_FLAGS(OpcodeToken0) ((OpcodeToken0)&D3D10_SB_GLOBAL_FLAGS_MASK) // ENCODER MACRO: Encode global flags #define ENCODE_D3D10_SB_GLOBAL_FLAGS(Flags) ((Flags)&D3D10_SB_GLOBAL_FLAGS_MASK)
The dest and source operand tokens in SM 4.0+ can use the MIN_PRECISION enum in the following circumstances:
Shader constants are defined at full 32-bit per component. New hardware implementing low precision is encouraged to design efficient downconversion support upon constant access, otherwise some driver work or extra conversion instructions will need to be added by the driver into shaders that read 32-bit per component constants into lower precision shader operations.
Alternative approaches were considered where low precision constants are exposed all the way to the application (freeing driver/hardware from having to convert constants), but the added complexity in the programming model vs the benefit didn’t hold up at least at this time.
When referencing a shader constant from a low precision instruction, if the constant value is out of the range of the instruction’s precision level, the value read is undefined. For constant values within range of a low precision instruction reference, the precision of the value may still get quantized down from full 32 bits.
Shader constants referenced in shader source operands will be marked at the precision they are to be referenced at, even though they come down the API/DDI at 32-bit per component.
Low precision data is referenced by component in masks and swizzles – xyzw - just like default precision data. It is as though the registers do have a smaller number of bits (for hardware that supports lower precision). This is unlike the way double precision is mapped, where xy contains one double and zw contains another. Low precision doesn’t yield sub-fields within .x for example.
The HLSL compiler will not generate code that mixes precisions in different components of any xyzw register (mostly for simplicity, even though this may not matter for hardware).
The use of min / low precision specifiers never increases the maximum amount of resources available to a shader (such as limits on inputs, outputs or temp storage), since the shader must always be able to function on hardware that does not operate at low precision.
In the D3D system, HLSL shaders are compiled independent of any given device – e.g. they should typically be compiled offline. This compilation step produces device-agnostic bytecode, apart from the choice of shader target, e.g. vs_4_0.
The minimum precision facility described above can be optionally used within any 4_0+ shader, including 4_0_level_9_1 to 4_0_level9_3. These shader targets are all available through the D3D11 runtime, exposing D3D9+ hardware via Shader Model 2_x+. The D3D9 runtime will not expose the low precision modes – updating that runtime is out of scope.
There is a mechanism at the API to discover the precision levels supported by the current device. Note that in Windows 8 the OS did not allow drivers to expose only 10 bit without also exposing 16 bit, but subsequent operating systems relax that requirement (so an implementation may expose 10 bit min precision but not 16 bit min precision).
Even though the hardware’s precision support is visible to applications, applications do not have to adjust their shaders for the hardware’s precision level given that by definition operations defined with a min precision run at higher precision on hardware that doesn’t support the min precision.
It is fine for hardware to not support low precision processing at all – by simply reporting “DEFAULT” as its precision support. The reason it is called “DEFAULT” rather than some numerical precision is depending on the shader model, there may not be standard value to express. E.g. the default precision in SM 2.x is fp24 (or greater) within the shader, even though there is no API visible fp24 format. If the device reports “DEFAULT” precision, all min-precision specifiers in shaders are ignored.
D3D9 devices are permitted to report a min-precision level that is lower for the Pixel Shader than for the Vertex Shader (all reported via the Windows Next D3D9 DDI). D3D10+ devices can only report a single min-precision level that applies to all shader stages (reported via the Windows Next D3D11.1 DDI) – since it does not seem to make sense to single out the VS any more. Note that if the application uses Feature Level 9_x on D3D10+ hardware, the D3D9 DDIs are still used, so the min-precision levels can be reported differently there between VS and PS, as mentioned for D3D9, even though via the D3D11.1 DDI only a single precision can be reported.
Regardless of the min precision level supported by a given device, it is always valid to use a shader that was compiled using any combination of the low precision levels on it. For example if a device’s min precision level is 32-bit, it is fine to use a shader compiled with some variables that have a min precision of 10 bit. The device is free to implement the low precision operations at any equal or higher precision level (including precision levels not available at the API).
For old drivers (pre-D3D11.1 DDI) that are not aware of the low precision feature, the D3D runtime will patch the shader bytecode on shader creation to remove it. This preserves the intent of the shader, since it is valid for the device to execute operations tagged with a min precision level at a higher precision.
An API for reporting device precision support, no other D3D11 API surface area changes apply.
As far as other DDI additions, there is device precision reporting, the shader bytecode additions detailed earlier, and finally a variant of the existing shader stage I/O signature DDI:
The I/O signature DDI includes MinPrecision in the signature entry. This shows up as D3D11_SB_INSTRUCTION_MIN_PRECISION_DEFAULT if the shader didn’t specify a min-precision:
typedef struct D3D11_1DDIARG_SIGNATURE_ENTRY { D3D10_SB_NAME SystemValue; // D3D10_SB_NAME_UNDEFINED if the particular entry doesn't have a system name. UINT Register; BYTE Mask;// (D3D10_SB_OPERAND_4_COMPONENT_MASK >> 4), meaning 4 LSBs are xyzw respectively D3D11_SB_INSTRUCTION_MIN_PRECISION MinPrecision; } D3D11_1DDIARG_SIGNATURE_ENTRY; typedef struct D3D11_1DDIARG_STAGE_IO_SIGNATURES { D3D11_1DDIARG_SIGNATURE_ENTRY* pInputSignature; UINT NumInputSignatureEntries; D3D11_1DDIARG_SIGNATURE_ENTRY* pOutputSignature; UINT NumOutputSignatureEntries; } D3D11_1DDIARG_STAGE_IO_SIGNATURES;
Motivation: Recall that this DDI exists to complement the shader creation DDIs by providing a more complete picture of the shader stage<->stage I/O layout than may be visible just from an individual shader’s bytecode. For example sometimes an upstream stage provides data not consumed by a downstream shader, but it should be possible for a driver to compile a shader on its own without having to wait and see what other shaders it gets used with. MinPrecision is added in case that affects how the driver shader compiler would want to pack the inter-stage I/O data.
Out of scope for this spec.
Chapter Contents
(back to top)
8.1 IA State
8.2 Drawing Commands
8.3 Draw()
8.4 DrawInstanced()
8.5 DrawIndexed()
8.6 DrawIndexedInstanced()
8.7 DrawInstancedIndirect()
8.8 DrawIndexedInstancedIndirect()
8.9 DrawAuto()
8.10 Primitive Topologies
8.11 Patch Topologies
8.12 Generating Multiple Strips
8.13 Partially Completed Primitives
8.14 Leading Vertex
8.15 Adjacency
8.16 VertexID
8.17 PrimitiveID
8.18 InstanceID
8.19 Misc. IA Issues
8.20 Input Assembler Data Conversion During Fetching
8.21 IA Example
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
An overview of the IA is at the beginning(2.1) of the document. This section provides implementation details more like they are viewed from the DDI perspective (exact parameter names may not match). The API view is different, in that instead of hardcoding shader register numbers in the state declaration, names are used, and when creating Input Assembler State objects, the runtime figures out which registers the names correspond based on a shader input signature definition.
An illustrated example of the IA being used is at the end(8.21) of this section.
Section Contents
(back to chapter)
8.1.1 Overview
8.1.2 Primitive Topology Selection
8.1.3 Input Layout
8.1.4 Resource Bindings
The states defining the Input Assembler's operation are described here. Draw*() commands on the Device, described below(8.2), use the currently active IA state to define most of their behavior.
The following enumeration lists the various Primitive Topologies(8.10) available to the IA.
typedef enum D3D11_PRIMITIVE_TOPOLOGY { D3D11_PRIMITIVE_TOPOLOGY_ILLEGAL = 0, // Cannot use this value. D3D11_PRIMITIVE_TOPOLOGY_POINTLIST = 1, D3D11_PRIMITIVE_TOPOLOGY_LINELIST = 2, D3D11_PRIMITIVE_TOPOLOGY_LINESTRIP = 3, D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST = 4, D3D11_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP = 5, // 6 is reserved (legacy triangle fan) // 7, 8 and 9 are also reserved D3D11_PRIMITIVE_TOPOLOGY_LINELIST_ADJ = 10, // start _ADJ at 10, D3D11_PRIMITIVE_TOPOLOGY_LINESTRIP_ADJ = 11, // so bit 3 can encode adjacency D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST_ADJ = 12, D3D11_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP_ADJ = 13, D3D11_PRIMITIVE_TOPOLOGY_1_CONTROL_POINT_PATCHLIST = 17, D3D11_PRIMITIVE_TOPOLOGY_2_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_3_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_4_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_5_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_6_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_7_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_8_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_9_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_10_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_11_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_12_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_13_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_14_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_15_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_16_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_17_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_18_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_19_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_20_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_21_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_22_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_23_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_24_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_25_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_26_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_27_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_28_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_29_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_30_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_31_CONTROL_POINT_PATCHLIST, D3D11_PRIMITIVE_TOPOLOGY_32_CONTROL_POINT_PATCHLIST } D3D11_PRIMITIVE_TOPOLOGY;
The current primitive topology for the IA is defined by the following method:
The following enumerations are used to build declarations of 1D Buffer structure layout. Structure fields are defined with format and offset, plus a target register. Multiple elements (from one or more structures) can not feed a single register.
typedef enum D3D11_INPUT_CLASSIFICATION { D3D11_INPUT_PER_VERTEX_DATA = 0, D3D11_INPUT_PER_INSTANCE_DATA = 1 } D3D11_INPUT_CLASSIFICATION; typedef struct D3D11_INPUT_ELEMENT_DESC { UINT InputSlot; UINT ByteOffset; DXGI_FORMAT Format; D3D11_INPUT_CLASSIFICATION InputSlotClass; // must be same for all Elements at same InputSlot UINT InstanceDataStepRate; // InstanceDataStepRate is how many // Instances to draw before stepping one // unit forward in a VertexBuffer containing // Instance Data. // InstanceDataStepRate must be 0 and is // not used when InputSlotClass == D3D11_INPUT_PER_VERTEX_DATA. // But when Class == D3D11_INPUT_PER_INSTANCE_DATA, // InstanceDataStepRate can be any value, including 0. // 0 takes special meaning, that the instance data // should never be stepped at all. // This must be the same for all Elements at same InputSlot UINT InputRegister; // Which register in the set of // inputs to the first active Pipeline // stage this Element is going to. } D3D11_INPUT_ELEMENT_DESC;
The following command creates an input layout.
CreateInputLayout( const D3D11_INPUT_ELEMENT_DESC* pDeclaration, SIZE_T NumElements, ID3D10InputLayout **ppInputLayout);
The following methods bind input vertex buffer(s) to the IA. A set of up to 32 Buffers can be bound at once. The layout of verrtex or instance data in all of the Buffers is defined by an Input Layout object. There is also a method for binding an Index Buffer to the IA (having a single Element format describing its data layout).
IASetVertexBuffers( UINT StartSlot, // first Slot for which a Buffer is being bound UINT NumBuffers, // number of slots having Buffers bound ID3D10Buffer *const *pVertexBuffers, const UINT *pStrides, const UINT *pOffsets ); IASetInputLayout( ID3D10InputLayout *pLayout, ID3D10InputLayout* pInputLayout ); IASetIndexBuffer( ID3D10Buffer* pBuffer, DXGI_FORMAT Format, UINT Offset );
The following rendering commands on a device, Draw()(8.3), DrawInstanced()(8.4), DrawIndexed()(8.5), DrawIndexedInstanced()(8.6), DrawInstancedIndirect()(8.7), and DrawIndexedInstancedIndirect()(8.8) introduce primitives into the D3D11.3 Pipeline.
Draw( UINT VertexCount UINT StartVertexLocation)
UINT VertexCount | How many vertices to read sequentially from the Vertex Buffer(s) |
UINT StartVertexLocation | Which Vertex to start at in each Vertex Buffer. |
See the pseudocode for DrawInstanced(), below. Draw() behaves the same as DrawInstanced(), with InstanceCount = 1 and StartInstanceLocation = 0. If "Instance" data has been bound, it will be used. But the intent is for this method to be used without instancing.
DrawInstanced( UINT VertexCountPerInstance, UINT InstanceCount, UINT StartVertexLocation, UINT StartInstanceLocation)
UINT VertexCountPerInstance | How many vertices to read sequentially from Buffer(s) marked as Vertex Data (same set repeated for each Instance). |
UINT InstanceCount | How many Instances to render. |
UINT StartVertexLocation | Which Vertex to start at in each Buffer marked as Vertex Data (for each Instance). |
UINT StartInstanceLocation | Which Instance to start sequentially fetching from in each Buffer marked as Instance Data. |
UINT VertexBufferElementAddressInBytes[32][32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT] // [D3D11_IA_VERTEX_INPUT_STRUCTURE_ELEMENT_COUNT] UINT InstanceDataStepCounter[32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT] // Initialize starting Vertex Buffer addresses for(each slot, s, with a VertexBuffer assigned) { if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA) { for(each Element, e, in the Buffer's Input Layout) { VertexBufferElementAddressInBytes[s][e] = Slot[s].VertexBufferOffsetInBytes + Slot[s].StrideInBytes*StartVertexLocation + Slot[s].pInputLayout->pElement[e].OffsetInBytes; } // Element loop } else // (Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA) { for(each Element, e, in the Buffer's Input Layout) { VertexBufferElementAddressInBytes[s][e] = Slot[s].VertexBufferOffsetInBytes + Slot[s].StrideInBytes*StartInstanceLocation + Slot[s].pInputLayout->pElement[e].OffsetInBytes; } // Element loop InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate; } } // slot loop // Now compute addresses and fetch data // for all elements of each buffer for each vertex // for each instance. for(UINT InstanceID = 0; InstanceID < InstanceCount; InstanceID++) { for(UINT VertexID = 0; VertexID < VertexCountPerInstance; VertexID++) { for(each slot, s, with a VertexBuffer assigned) { for(each Element, e, in the buffer's Input Layout) { // Fetch this vertex Element's data from Slot[s].pBuffer // at address VertexBufferElementAddressInBytes[s][e], // with type Slot[s].pInputLayout->pElement[e].Format, // and output to the Shader Register identified by Slot[s].pInputLayout->pElement[e].Register, // taking account the writemask declared in the shader. FetchDataFromMemory(VertexBufferElementAddressInBytes[s][e],s,e); if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA) { // Increment the address for the next access VertexBufferElementAddressInBytes[s][e] += Slot[s].StrideInBytes; } } // Element loop } // slot loop } // vertex loop // Patch Instance and Vertex Data addresses at the end of an instance. for(each slot, s, with a VertexBuffer assigned) { if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA) { for(each Element, e, in the buffer's structure declaration) { VertexBufferElementAddressInBytes[s][e] = Slot[s].VertexBufferOffsetInBytes + Slot[s].StrideInBytes*StartVertexLocation + Slot[s].pInputLayout->pElement[e].OffsetInBytes; } // Element loop } else //(Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA) { if(1 == InstanceDataStepCounter[s]) { for(each Element, e, in the buffer's structure declaration) { VertexBufferElementAddressInBytes[s][e] += Slot[s].StrideInBytes; } InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate; } else if(1 < InstanceDataStepCounter[s]) { InstanceDataStepCounter[s]--; } } } // slot loop RestartTopology(); // restart at the end of an instance } //instance loop
// The following pseudocode for calculating IDs has been separated out from the // address calculation pseudocode above, for clarity. In practice the // algorithms would be merged, or possibly be implemented as part of the // primitive assembly process. Note that VertexID/PrimitiveID/InstanceID // values are unrelated to address calculations for IA data fetching. // If desired, applications can choose ID starting values so that IDs can be used in // Shaders to load data from memory out of similar locations in memory as // the IA's fixed addressing calculations would have. UINT VertsPerPrimitive = GetNumVertsBetweenPrimsInCurrentTopology(); // e.g. VertsPerPrimitive = 3 for tri list // = 6 for tri list w/adj // = 1 for tri strip // = 2 for tri strip w/adj // = 2 for line list // = 4 for line list w/adj // = 1 for line strip // = 1 for line strip w/adj // = 1 for point list UINT VertsPerCompletedPrimitive = GetNumVertsUntilFirstCompletedPrimitiveInCurrentTopology(); // e.g. VertsPerCompletedPrimitive = 3 for tri list // = 6 for tri list w/adj // = 3 for tri strip // = 7 for tri strip w/adj, (not 6) since 1 // vert is not involved in the prim, // when the strip has more than one // primitive. // = 2 for line list // = 4 for line list w/adj // = 2 for line strip // = 4 for line strip w/adj // = 1 for point list for(UINT InstanceID = 0; InstanceID < InstanceCount; InstanceID++) { UINT PrimitiveID = 0; UINT VertsUntilNextCompletePrimitive = VertsPerCompletedPrimitive; SetNextInstanceID(InstanceID); // subsequent vertices and primitives // will get this InstanceID for(UINT VertexID = 0; VertexID < VertexCountPerInstance; VertexID++) { VertsUntilNextCompletePrimitive--; if( VertsUntilNextCompletePrimitive == 0 ) { SetNextPrimitiveID(PrimitiveID++); VertsUntilNextCompletePrimitive = VertsPerPrimitive; } SetNextVertexID(VertexID); } // vertex loop if( IsTriangleStripWithAdjacency() && (VertsUntilNextCompletePrimitive == 1) { // When traversing a triangle strip w/ adjacency, after the initial 7 // vertices, every other vertex completes a primitive, EXCEPT when // the end of the strip is reached, where the last 2 consecutive // vertices each complete a primitive. SetNextPrimitiveID(PrimitiveID++); // in a tristrip w/adj // the last completed primitive has // not been counted yet. } } // instance loop
DrawIndexed( UINT IndexCount, UINT StartIndexLocation, INT BaseVertexLocation)
UINT IndexCount | How many indices to read sequentially from the Index Buffer. |
UINT StartIndexLocation | Which Index to start at in the Index Buffer. |
INT BaseVertexLocation | Which Vertex in each buffer marked as Vertex Data to consider as Index "0". Note that this value is signed. A negative BaseVertexLocation allows, for example, the first vertex to be referenced by an index value > 0. |
See the pseudocode for DrawIndexedInstanced(), below. DrawIndexed() behaves the same as DrawIndexedInstanced(), with InstanceCount = 1 and StartInstanceLocation = 0. If "Instance" data has been bound, it will be used. But the intent is for this method to be used without instancing.
DrawIndexedInstanced( UINT IndexCountPerInstance, UINT InstanceCount, UINT StartIndexLocation, INT BaseVertexLocation, UINT StartInstanceLocation)
UINT IndexCountPerInstance | How many indices to read sequentially from the Index Buffer (same set repeated for each Instance). |
UINT InstanceCount | How many Instances to render. |
UINT StartIndexLocation | Which Index to start at in the Index Buffer (for each Instance). |
INT BaseVertexLocation | Which Vertex in each buffer marked as Vertex Data to consider as Index "0". Note that this value is signed. A negative BaseVertexLocation allows, for example, the first vertex to be referenced by an index value > 0. |
UINT StartInstanceLocation | Which Instance to start sequentially fetching from in each Buffer marked as Instance Data. |
UINT VertexBufferElementAddressInBytes[32][32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT] // [D3D11_IA_VERTEX_INPUT_STRUCTURE_ELEMENT_COUNT] UINT InstanceDataStepCounter[32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT] // Initialize starting Index Buffer address UINT IndexBufferElementAddressInBytes = StartIndexLocation*sizeof(IndexBuffer.Format) + IndexBufferOffsetInBytes; // Initialize starting Vertex Buffer addresses // (relevant to Instance Data only, as this is traversed without indexing. for(each slot, s, with a VertexBuffer assigned) { if(Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA) { for(each Element, e, in the Buffer's structure declaration) { VertexBufferElementAddressInBytes[s][e] = Slot[s].VertexBufferOffsetInBytes + Slot[s].StrideInBytes*StartInstanceLocation + Slot[s].pInputLayout->pElement[e].OffsetInBytes; } // Element loop InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate; } } // slot loop // Now compute addresses and fetch data // for all elements of each buffer for each vertex // for each instance. for(UINT InstanceID = 0; InstanceID < InstanceCount; InstanceID++) { for(UINT i = 0; i < IndexCountPerInstance; i++) { UINT IndexValue = FetchIndexFromIndexBuffer(IndexBufferElementAddressInBytes,IndexBuffer.Format) if(GetPredefinedCutIndexValue(IndexBuffer.Format) == IndexValue) { RestartTopology(); // Increment the index address IndexBufferElementAddressInBytes += sizeof(IndexBuffer.Format); // No vertex to fetch for this iteration... continue; } for(each slot, s, with a VertexBuffer assigned) { UINT IndexedOffset; if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA) { IndexedOffset = Slot[s].StrideInBytes*( BaseVertexLocation + IndexValue); } for(each Element, e, in the buffer's structure declaration) { if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA) { VertexBufferElementAddressInBytes[s][e] = Slot[s].VertexBufferOffsetInBytes + IndexedOffset + Slot[s].pInputLayout->pElement[e].OffsetInBytes; } // Fetch this vertex Element's data from Slot[s].pBuffer // at address VertexBufferElementAddressInBytes[s][e], // with type Slot[s].pInputLayout->pElement[e].Format, // and output to the Shader Register identified by Slot[s].pInputLayout->pElement[e].Register, // taking account the writemask declared in the shader. FetchDataFromMemory(VertexBufferElementAddressInBytes[s][e],s,e); } // Element loop } // slot loop // Increment the index address IndexBufferElementAddressInBytes += sizeof(IndexBuffer.Format); } // index loop // Patch Instance Data addresses at the end of an instance. for(each slot, s, with a VertexBuffer assigned) { if(Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA) { if(1 == InstanceDataStepCounter[s]) { for(each Element, e, in the buffer's structure declarationn) { VertexBufferElementAddressInBytes[s][e] += Slot[s].StrideInBytes; } InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate; } else if(1 < InstanceDataStepCounter[s]) { InstanceDataStepCounter[s]--; } } } // slot loop RestartTopology(); // restart at the end of an instance } //instance loop
// The following pseudocode for calculating IDs has been separated out from the // address calculation pseudocode above, for clarity. In practice the // algorithms would be merged, or possibly be implemented as part of the // primitive assembly process. Note that VertexID/PrimitiveID/InstanceID // values are unrelated to address calculations for IA data fetching. // If desired, applications can choose ID starting values so that IDs can be used in // Shaders to load data from memory out of similar locations in memory as // the IA's fixed addressing calculations would have. UINT VertsPerPrimitive = GetNumVertsBetweenPrimsInCurrentTopology(); // e.g. VertsPerPrimitive = 3 for tri list // = 6 for tri list w/adj // = 1 for tri strip // = 2 for tri strip w/adj // = 2 for line list // = 4 for line list w/adj // = 1 for line strip // = 1 for line strip w/adj // = 1 for point list UINT VertsPerCompletedPrimitive = GetNumVertsUntilFirstCompletedPrimitiveInCurrentTopology(); // e.g. VertsPerCompletedPrimitive = 3 for tri list // = 6 for tri list w/adj // = 3 for tri strip // = 7 for tri strip w/adj, (not 6) since 1 // vert is not involved in the prim, // when the strip has more than one // primitive. // = 2 for line list // = 4 for line list w/adj // = 2 for line strip // = 4 for line strip w/adj // = 1 for point list UINT CutIndexValue = GetPredefinedCutIndexValue(IndexBuffer.Format); for(UINT InstanceID = 0; InstanceID < InstanceCount; InstanceID++) { UINT PrimitiveID = 0; UINT VertsUntilNextCompletePrimitive = VertsPerCompletedPrimitive; SetNextInstanceID(InstanceID); // subsequent vertices and primitives // will get this InstanceID for(UINT i = 0; i < IndexCountPerInstance; i++) { UINT IndexValue = FetchIndexFromIndexBuffer(); // detail hidden // IndexValue assignment above: Detail hidden, see full index fetch calculation in // DrawIndexedInstanced() pseudocode (which in practice this code would be merged with) if(CutIndexValue == IndexValue) { if( IsTriangleStripWithAdjacency() && (VertsUntilNextCompletePrimitive == 1) { // When traversing a triangle strip w/ adjacency, after the initial 7 // vertices, every other vertex completes a primitive, EXCEPT when // the end of the strip is reached, where the last 2 consecutive // vertices each complete a primitive. SetNextPrimitiveID(PrimitiveID++); // in a tristrip w/adj // the last completed primitive has // not been counted yet. } VertsUntilNextCompletePrimitive = VertsPerCompletedPrimitive; } else { VertsUntilNextCompletePrimitive--; if( VertsUntilNextCompletePrimitive == 0 ) { SetNextPrimitiveID(PrimitiveID++); VertsUntilNextCompletePrimitive = VertsPerPrimitive; } SetNextVertexID(IndexValue); } } // vertex loop if( IsTriangleStripWithAdjacency() && (VertsUntilNextCompletePrimitive == 1) { // When traversing a triangle strip w/ adjacency, after the initial 7 // vertices, every other vertex completes a primitive, EXCEPT when // the end of the strip is reached, where the last 2 consecutive // vertices each complete a primitive. SetNextPrimitiveID(PrimitiveID++); // in a tristrip w/adj // the last completed primitive has // not been counted yet. } } // instance loop
DrawInstancedIndirect( ID3D11Buffer *pBufferForArgs, UINT AlignedByteOffsetForArgs); struct DrawInstancedIndirectArgs { UINT VertexCountPerInstance, UINT InstanceCount, UINT StartVertexLocation, UINT StartInstanceLocation) }
ID3D11Buffer *pBufferForArgs | A buffer that contains an array of DrawInstancedArgs, described in the struct above. |
UINT AlignedByteOffsetForArgs | A DWORD aligned - byte offset for the data. |
UINT VertexCountPerInstance | How many vertices to read sequentially from Buffer(s) marked as Vertex Data (same set repeated for each Instance). |
UINT InstanceCount | How many Instances to render. |
UINT StartVertexLocation | Which Vertex to start at in each Buffer marked as Vertex Data (for each Instance). |
UINT StartInstanceLocation | Which Instance to start sequentially fetching from in each Buffer marked as Instance Data. |
If the address range in the Buffer where DrawInstancedIndirect’s parameters will be fetched from would go out of bounds of the Buffer, behavior is undefined.
Here(18.6.5.1) is a discussion about ways to initialize the arguments for DrawInstancedIndirect.
DrawIndexedInstancedIndirect( ID3D11Buffer *pBufferForArgs, UINT AlignedByteOffsetForArgs); struct DrawIndexedInstancedIndirectArgs { UINT IndexCountPerInstance, UINT InstanceCount, UINT StartIndexLocation, UINT BaseVertexLocation, UINT StartInstanceLocation) }
ID3D11Buffer *pBufferForArgs | A buffer that contains an array of DrawInstancedArgs, described in the struct above. |
UINT AlignedByteOffsetForArgs | A DWORD aligned byte offset for the data. |
UINT IndexCountPerInstance | How many indices to read sequentially from the Index Buffer (same set repeated for each Instance). |
UINT StartIndexLocation | Which Index to start at in the Index Buffer.(for each Instance). |
UINT InstanceCount | How many Instances to render. |
INT BaseVertexLocation | Which Vertex in each buffer marked as Vertex Data to consider as Index "0". Note that this value is signed. A negative BaseVertexLocation allows, for example, the first vertex to be referenced by an index value > 0. |
UINT StartInstanceLocation | Which Instance to start sequentially fetching from in each Buffer marked as Instance Data. |
If the address range in the Buffer where DrawIndexedInstancedIndirect’s parameters will be fetched from would go out of bounds of the Buffer, behavior is undefined.
Here(18.6.5.1) is a discussion about ways to initialize the arguments for DrawIndexedInstancedIndirect.
DrawAuto is used with StreamOutput(14) in order to use a Stream Output Buffer as an Input Assembler Vertex Input Buffer without requiring the BufferFilledSize to get back to the CPU. The Buffer bound to slot zero must have both the Stream Output andInput Assembler Vertex Input Bind Flags set. When invoked, DrawAuto will draw from the Buffer offset associate with slot zero to the BufferFilledSize(14.4) associated with the Buffer. If the BufferFilledSize is less then or equal to the specified buffer offset, then nothing is drawn. The primitive type for DrawAuto is the current primitive topology set via IASetPrimitiveTopology(8.1.2), regardless of the geometry shader output topology used while the buffer is filled.
Buffers may be bound to other IA input slots above zero for DrawAuto (only the IA bind flag is required on these slots), and these can be part of the Vertex Declaration as well. Reading out of bounds on any Buffer above slot zero in DrawAuto invokes the default behavior for reading out of bounds (as with any other Draw* call).
DrawAuto()
The diagram below defines the vertex ordering for all of the primitive topologies that the IA can produce. The enumeration of primitive topologies is here(8.1.2).
As an example, suppose the IA is asked to draw triangle lists with adjacency, and it is invoked with a vertex cont of 36 by a Draw() call. From the diagram it should be apparent that a 36-vertex triangle list with adjacency will result in 6 completed primitives.
An interesting property of all the topologies with adjacency (except line strips) is that they contain exactly double the number of vertices as the equivalent topology without adjacency. Every other vertex represents an "adjacent" vertex.
Not shown in the previous diagram (but part of the same list) are 32 additional topologies which represent 1...32 control point patches, respectively. These Patch topologies can be used with Tessellation(11). Also, when Tessellation is disabled(11.8) (meaning no Hull Shader and no Domain Shader bound), they can be fed to the Geometry Shader and/or Stream Output, allowing patch data to be saved to memory, and allowing non-traditional primitive types to be fed to the GS (such as simulating cubes using 8 control point patches to represent 8 vertices).
In Indexed rendering of strip topologies, the maximum representable index value in the index format (i.e. 0xffffffff for 32-bit indices) means the strip defined up to the previous index is to be completed, and the next index is a new strip. This special "cut" value is not required to be used, in which case a DrawIndexed*() command will simply draw one strip. In IndexedInstanced rendering, there is an automatic "cut" after every instance. Regardless of Instanced rendering or not, it is optional whether to make the last index the cut value, or omit the value; both result in the same behavior, except that the IndexCount[PerInstance] parameter to DrawIndexed[Instanced]() is different by 1.
Even if the current Primitive Topology is not a strip, then the cut index value still takes effect, potentially resulting in an incomplete primitive (see next section). Thus, handling of the cut is kept orthogonal to primitive topology, even though it is not useful for some of them.
Note that providing a behavior for the cut value when used with a non-strip topology is a way of saying that the behavior is defined, allowing hardware to keep the cut behavior always enabled. In practice though, using cut for a list topology is obviously not a "feature" that it would ever make sense for an application to author to.
Each Draw*() call starts a new Primitive Topology; there is no persistence of any topology produced by a previous Draw() call. Triangle strips don't continue across Draw() call boundaries.
If a Draw*() call produces incomplete primitives (not enough vertices), either at the end of the Draw*() call, or anywhere in the middle (possible with the "cut" index), any incomplete primitives are silently discarded. For example, suppose a Draw*() call is made with triangle list as the topology, and an vertex count of 5. This case would result in a single triangle, and the last 2 vertices being silently discarded. For another example showing handling of an incomplete primitive, see the diagram under the Geometry Shader Stage here(13.10), depicting which primitives are instantiated given a triangle strip with adjacency that has a dangling vertex.
For the purpose of assigning constant vertex attributes to primitives, there must be a way to map a vertex to a primitive. Let us identify the vertex in a primitive which supplies its per-primitive constant data as the "leading vertex". A primitive topology can have multiple leading vertices, one for each primitive in the topology. The leading vertex for an individual primitive in a topology is the first non-adjacent vertex in the primitive. For the triangle strip with adjacency above, the leading vertices are 0, 2, 4, 6, etc. For the line strip with adjacency, the leading vertices are 1, 2, 3 etc.
Note that adjacent primitives have no leading vertex. This means that there is no primitive data associated with adjacent primitives. With the strip topologies, a given interior primitive has some adjacent primitives which are also interior to the topology, and so actually can have primitive data. However, as far as the Geometry Shader is concerned (it sees a single primitive and its neigboring primitives in an invocation), only the single interior primitive defining the Geometry Shader invocation can have Primitive Data, and adjacent primitives, whether they are interior to the strip or adjacent primitives on the strip, never come with Primitive Data.
The only place in the Pipeline where adjacency information is visible to the application is in the Geometry Shader. Each invocation of the Geometry Shader sees a single primitive: a point, line, or triangle, and some of these might include adjacent vertices.
The inputs to the Geometry Shader are like a single primitive of any of the "list" primitive topologies (with or without adjacency) in the diagram above. When adjacency is available, the Geometry Shader will see the appropriate adjacent vertices along with the primitive's vertices. So for example if the Geometry Shader is invoked with a triangle including adjacency (the source could have been a strip with adjacency), this would mean that data for 6 vertices would be available as input in the Geometry Shader: 3 vertices for the triangle, and 3 for the adjacency.
The data layout for adjacent vertices is identical to the standard vertices they accompany. Note that Vertex Shaders are always run on all vertices, including adjacent vertices. Note that adjacent vertices are typically also surface vertices some other primitive that will get drawn, so the Vertex Shader result cache can take advantage of this.
When the IA is instructed to produce a primitive topology with adjacency for its output, all adjacent vertices must be specified. There is no concept of handling edges with no adjacent primitive. The application must deal with this on their own, perhaps by providing a dummy vertex (possibly forming a degenerate triangle), or perhaps by flagging in one of the vertex attributes whether the vertex "exists" or not. The application's Geometry Shader code will have to detect this situation, if desired, and deal with it manually. Implied in this is that there must be no culling of degenerate primitives until rasterizer setup, so that the Geometry Shader is guaranteed to see all geometry.
Note that when Tessellation is enabled, topologies with adjacency cannot be used. The Tessellator operates a patch at a time without hardware knowledge about adjacency (although shader code is free to encode it on its own). The Tessellator's outputs are independent primitives, with no adjacency information.
VertexID is a 32-bit unsigned integer scalar counter value coming out of Draw*() calls identifying to Shaders each vertex. This value can be declared(22.3.11) for input by the Vertex Shader only.
For Draw() and DrawInstanced(), VertexID starts at 0, and it increments for every vertex. Across instances in DrawInstanced(), the count resets back to the start value. Should the 32-bit VertexID calculation overflow, it simply wraps.
For DrawIndexed() and DrawIndexedInstanced(), VertexID represents the index value.
The mere presence of VertexID in a Vertex Shaders' input declarations activates the feature (there is no other control outside the shader). If the application wishes to pass this data to later Pipeline stages, the application can do so by simply writing the value to a Shader output register like any other data.
For Primitive Topologies with adjacency, such as a triangle strip w/adjacency, the "adjacent" vertices participate have a VertexID associated with them just like the "non-adjacent" vertices do, all generated uniformly (i.e. without regards to which vertices are adjacent and which are not in the topology).
For more information, see the general discussion of System Generated Values here(4.4.4), the reference for VertexID here(23.1), and the System Interpreted/Generated Value input(22.3.11) declaration for Shaders.
PrimitiveID is a 32-bit unsigned integer scalar counter value coming out of Draw*() calls identifying to Shaders each primitive. This value can be declared(22.3.11) for input by either the Hull Shader, Domain Shader, Geometry Shader or Pixel Shader Stage. For the GS and PS, if the GS is active the hardware PrimitiveID goes there and shader computed PrimitiveIDs go to the PS.
PrimitiveID starts at 0 for the first primitive generated by a Draw*() call, and increments for each subsequent primitive. When Draw*Instanced() is used, the PrimitiveID resets to its starting value whenever a new instance begins in the set of instances produced by the call. Should the 32-bit PrimitiveID calculation overflow, it simply wraps.
The mere presence of PrimitiveID in a compatible Shader Stage's input declarations activates the feature (there is no other control outside the shader). In the Geometry Shader this is declared as the special register vPrim (to decouple the value from the other per-vertex inputs). If the application wishes to pass PrimitiveID to a later Pipeline stage, the application can do so by simply writing the value to a Shader output register like any other data. The Pixel Shader does not have a separate input for PrimitiveID; it just goes into a component of a normal input register, with the requirement that the interpolation mode on the entire input register (which may contain other data as well in other components, is chosen as "constant".
For Primitive Topologies(8.10) with adjacency, such as a triangle strip w/adjacency, the PrimitiveID is only maintained for the interior primitives in the topology (the non-adjacent primitives), just like the set of primitives in a triangle strip without adjacency. No point in the Pipeline has a way of asking for an auto-generated PrimitiveID for adjacent primitives.
For more information, see the general discussion of System Generated Values here(4.4.4), the reference for PrimitiveID here(23.2), and the System Interpreted/Generated Value input(22.3.11) and output(22.3.33) declarations for Shaders.
InstanceID is a 32-bit unsigned integer scalar counter value coming out of Draw*() calls identifying to Shaders which instance is being drawn. This value can be declared(22.3.11) for input by the by the Vertex Shader only.
InstanceID starts at 0 for the first instance of vertices generated by a Draw*() call. If the Draw is a Draw*Instanced() call, after each instance of vertices, the InstanceID increments by one. If the Draw is not a Draw*Instanced() call, then InstanceID never changes. Should the 32-bit InstanceID calculation overflow, it simply wraps.
The mere presence of InstanceID in the Vertex Shader's input declarations activates the feature (there is no other control outside the shader). If the application wishes to pass this data to later Pipeline stages, the application can do so by simply writing the value to a Shader output register like any other data.
For more information, see the general discussion of System Generated Values here(4.4.4), the reference for InstanceID here(23.3), and the System Interpreted/Generated Value input(22.3.11) declaration for Shaders.
Section Contents
(back to chapter)
8.19.1 Input Assembler Arithmetic Precision
8.19.2 Addressing Bounds
8.19.3 Buffer and Structure Offsets and Strides
8.19.4 Reusing Input Resources
8.19.5 Fetching Data in the IA vs. Fetching Later (i.e. Multiple Ways to Do the Same Thing)
The Input Assembler performs 32-bit unsigned integer arithmetic, conforming to the IA addressing pseudocode shown in this spec. In other words, should any calculation overflow 32-bits, it would wrap - and should that result happen to fall back into a valid range for the scenario, so be it. Wherever input parameters are listed as signed integers (such as BaseVertexLocation in DrawIndexed()(8.5)) they are interpreted, unaltered, as unsigned 32-bit numbers, used in unsigned 32-bit addressing arithmetic, producing unsigned 32-bit results.
An individual Draw*() call is limited to producing a finite number of vertices. This limit includes any instancing that is occurring within the Draw*() call. Independent of such a limit, there are also limits on how big various source data buffers can be. All of these (large) numbers can be found within the table(21) in the Limits On Various System Resource section. These numbers are expected to be reasonable for the foreseeable lifetime of D3D11.3.
Any calculated address that would fall out of bounds for a Buffer being accessed results in out-of-bounds behavior being invoked, where the return is 0 in all non-missing components of the format (defined in the Input Layout), and the default for missing components (see Defaults for Missing Components(19.1.3.3)). This out-of-bounds behavior applies, for example, when an index refers to a vertex number that is outside of the bound vertex buffer.
The minimum extent for the bounds is any initial offset applied on the Buffer (so "negative" indexing isn't a feature).
See the Element Alignment(4.4.6) section.
It is perfectly legal to read any given memory Buffer in multiple places in the Pipeline, including the IA, simultaneously, even applying different interpretations to the data in the Buffer. A single Buffer can even be set as input at multiple slots at a single stage such as the IA.
For example, suppose an application has a Vertex Shader that requires 2 different sets of input texture coordinates. One scenario could be to use 2 different input Buffers to provide the different texture coordinates to be fetched by the IA (or both texture coordinates could be interleaved in one Buffer). But an alternate, equally valid scenario is to reuse the same source data to supply both texture coordinates to what the Vertex Shader expects as two different sets. This is simply a matter of binding the same input Buffer to two different input slots.
Another way to achieve the same effect of reusing a single set of data is to bind the source texture coordinate Buffer to a single slot and provide a data declaration where the definition of 2 different texture coordinates overlaps (same structure offset). Partial-overlapping of types in a data declaration is even permitted (even though it isn't interesting); the point is that D3D11.3 doesn't care or bother to check.
Similarly, the structure stride in a vertex declaration can be any non-negative value (up to a maximum of 2048 Bytes, and conforming to alignment(4.4.6) rules), without regards to whether it is large enough to support the fields defined for the structure. Again, the point is that D3D11.3 doesn't care or bother to check. Debug tools can be provided to optionally enforce well-ordered, logical data layouts, however the arithmetic that underlying hardware uses to actually address data simply blindly follows the intent shown by the pseudocode for address-calculations for the Draw*()(8.2) routines.
It is legal to have a single Buffer containing both vertex data and index data, and thus bind the Buffer at both a Vertex Buffer input slot and as an Index Buffer simultaneously. One might store indices at the beginning of the Buffer and the vertex data being referred to elsewhere in the same Buffer. D3D11.3 doesn't care.
As yet another, final (contrived) example, to drive the point home: Suppose a Vertex Buffer is set as input to the IA to provide data for vertices going to the Vertex Shader (as usual). Simultaneously, the same Vertex Buffer may be accessed directly by the Vertex Shader, if for some reason the Shader occasionally wanted to look at some of the input data for vertices other than itself.
The highly flexible and programmable nature of the D3D11.3 Pipeline leads to many situations where there are multiple ways to accomplish a single task. A particular example relevant to this section is that the fetching of vertex data performed by the IA can be identically performed by doing memory fetches from the Vertex Shader only given a VertexID as input. There are nice properties from this, such as the fact that even though the amount of data the IA can pre-fetch for a single vertex is limited in size, memory fetches from shaders can allow much more unbounded amounts of vertex data to be fetched if necessary. Memory fetches from shaders can also use much more complex addressing arithmetic than the common-case dedicated fixed-function arithmetic used by the IA.
No guarantees or requirements are made by D3D11.3, however, as to the performance characteristics of using alternative mechanisms to perform a task that can be performed by an explicit feature intended for that task in the Pipeline. As a general rule, whenever there is an explicit mechanism to perform a task in D3D11.3, IHVs and ISVs should assume that as much as possible, the dedicated functionality is the preferred route, at least when all of or most of the other parts of the graphics Pipeline are simultaneously active.
When the Input Assembler reads Elements of data from Buffers, it gets converted to the appropriate 32-bit data type for the Format(19.1) interpretation specified. The conversion uses the the Data Conversion(3.2) rules. If the source data contains 32-bit per-component float, UINT or SINT data, it is read without modifying the bits at all (no conversion).
If a Vertex Buffer or Index Buffer is read by the Input Assembler, but the slot being read has no Buffer bound, the result of the read is 0 for all expected components. Even though there is format information available via the input layout, defaults are not applied to missing channels for this case.
The following example shows DrawIndexedInstanced()(8.6) being used to draw 3 instances of an indexed mesh.
The example does not attempt to draw anything particularly interesting, but it does show most of the functionality of the IA being used at once, in complete detail. Included is a depiction of the resulting workload for the rest of the Graphics Pipeline.
As input, one Vertex Buffer supplies Vertex Data, another Vertex Buffer supplies Instance Data, and there is an Index Buffer. The data layouts and configuration of all of these buffers is illustrated. VertexID(8.16), PrimitiveID(8.17) and InstanceID(8.18) are all shown as well, assuming Shaders in the pipeline requested them. The Primitive Topology(8.10) being rendered is triangle strip with adjacency. The Index Buffer has a Cut(8.12) in it, so multiple strips are produced (per instance).
Various states shown in boxes represent the API settings for Buffers and for the IA states described earlier in this IA spec.
Chapter Contents
(back to top)
9.1 Vertex Shader Instruction Set
9.2 Vertex Shader Invocation
9.3 Vertex Shader Inputs
9.4 Vertex Shader Output
9.5 Registers
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
The Vertex Shader instruction set is listed here(22.1.3).
For every vertex generated by the IA, Vertex Shader is invoked, provided that there is a miss on the hardware's Vertex Shader result cache. Adjacent vertices are treated equivalently to interior vertices in a topology, so the Vertex Shader is executed for all vertices.
The primary inputs to a Vertex Shader invocation are 32 32-bit*4-component registers (v#) comprising the elements of the input vertex (not all have to be used). ConstantBuffers (cb#) and textures (t#) provide random access input to Vertex Shaders.
The output of a Vertex Shader is up to 32 32-bit*4 component registers (o#). The o# registers to be written by the Shader must be declared (i.e. "dcl_output o[3].xyz").
The following registers are available in the vs_5_0 model:
Register Type | Count | r/w | Dimension | Indexable by r# | Defaults | Requires DCL |
---|---|---|---|---|---|---|
32-bit Temp (r#) | 4096 (r# + x#[n]) | r/w | 4 | n | none | y |
32-bit Indexable Temp Array (x#[n]) | 4096 (r# + x#[n]) | r/w | 4 | y | none | y |
32-bit Input (v#) | 32 | r | 4 | y | none | y |
Element in an input resource (t#) | 128 | r | 1 | n | none | y |
Sampler (s#) | 16 | r | 1 | n | none | y |
ConstantBuffer reference (cb#[index]) | 15 | r | 4 | y(contents) | none | y |
Immediate ConstantBuffer reference (icb[index]) | 1 | r | 4 | y(contents) | none | y |
Output Registers: | ||||||
NULL (discard result, useful for ops with multiple results) | n/a | w | n/a | n/a | n/a | n |
32-bit output Vertex Data Element (o#) | 32 | w | 4 | n/a | n/a | y |
Unordered Access View (u#) | 64 | r/w | 1 | n | n | y |
Chapter Contents
(back to top)
10.1 Hull Shader Instruction Set
10.2 Hull Shader Invocation
10.3 HS State Declarations
10.4 HS Control Point Phase
10.5 HS Patch Constant Phases
10.6 Hull Shader Structure Summary
10.7 Hull Shader Control Point Phase Contents
10.8 Hull Shader Fork Phase Contents
10.9 Hull Shader Join Phase Contents
10.10 Hull Shader Tessellation Factor Output
10.11 Restrictions on Patch Constant Data
10.12 Shader IL "Ret" Instruction Behavior in Hull Shader
10.13 Hull Shader MaxTessFactor Declaration
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
For a Tessellation overview, see the Tessellator(11) section.
The Hull Shader instruction set is listed here(22.1.4).
The Hull Shader operates once per patch, transforming Control Points, computing Patch Constant data and defining Tessellation Factors.
The Hull Shader has four phases, all defined together as one program. That is, from the API/DDI point of view, the Hull Shader is a single atomic shader, and its phases are an implementation detail within the Hull Shader program. Implementations can choose to exploit independent work within a Patch by executing work within a single patch in parallel.
The phases appear in the Intermediate Language as standalone shaders, each with individual input and output declarations tailored to what each independent program is doing. However the inputs and outputs across all of the shaders come out of a fixed pool of Hull Shader-wide input data and output storage, described later in great detail.
The Hull Shader phase structure is depicted in the following picture:
This section of the Hull Shader has no executable code. It simply declares some overall characteristics of Hull Shader operation, such as how many control points the HS inputs and outputs (an independent number). The operation of the fixed function Tessellator is also defined here – such as choosing the patch domain, partitioning etc. A tessellation pattern overview is given here(11.7).
Note that declarations that are typical in shaders, such as input and output register declarations and declarations of input Resources, Constant Buffers, Samplers etc. are part of each individual shader phase below, not part of this HS State declaration section.
See Tessellator State(11.7.15).
In the Hull Shader’s Control Point phase, a thread is invoked once per patch output control point. An input value vOutputControlPointID(23.7) identifies to each thread which output control point it represents. Each of the threads see shared input of all the input control points for the patch. The output of each thread is one of the output control points for the patch.
Section Contents
(back to chapter)
10.5.1 Overview
10.5.2 HS Patch Constant Fork Phase
10.5.3 HS Patch Constant Join Phase
The Patch Constant phases compute constant data such as Tessellation Factors(10.10) (how much the fixed function Tessellator should tessellate), as well as any other Patch Constant data, beyond the patch Control Points, that the application may need in the Domain Shader(12) (the shader that runs once per Tessellator output point).
The Patch Constant phases occur after the Control Point phase is complete, and has read-only access to all of the input and output Control Points. So for example, Control Points could be examined to help calculate Tessellation Factors(10.10) for each patch edge.
There are two Patch Constant phases:
The Patch Constant Fork Phase is a collection of an arbitrary number of independent programs. For the discussion in this section let us call these independent programs mini-shaders.
Each mini-shader produces independent (non-overlapping) parts of the total output Patch Constant data (such as all the different TessFactors(10.10)).
An implementation could choose to execute each mini-shader in parallel, since they are independent. Or, in the opposite extreme an implementation could choose to trivially concatenate all the mini-shaders together and run them serially. Such transformations of the mini-shaders are trivial to perform (in a driver’s compiler) given they all share the same inputs and perform non-overlapping writes to a unified output space.
An implementation could even choose to hoist any amount of the code from the Fork Phase phase up into the Control Point Phase if that happened to be most efficient. This is allowable because all the parts of a Hull Shader are specified together as if it is one program – how its contents are executed does not matter as long as the output is deterministic.
The shared inputs to each mini-shader are all of the Control Point Phase’s Input and Output Control Points.
The output of each mini-shader is a non overlapping subset of the output Patch Constant Data.
There is no communication of data between mini-shaders, other than the fact that they share Control Point input.
To further enable parallelism within a single mini-shader, any mini-shader can be declared to run in an instanced fashion, given a fixed instance count per patch. During execution, each instance of an instance mini-shader is identified by a ForkInstanceID(23.8) and is responsible for producing a unique output, typically by indexing an array of outputs. So for example, a single mini-shader instanced 4 times could output edge TessFactors for each edge of a quad patch.
The final Hull Shader phase is the Patch Constant Join Phase. This phase behaves the same way as the Fork Phase, in that there can be multiple Join programs that are independent of each other. All of them execute after all the Fork Phase programs. An example use for this phase is to derive TessFactors(10.10) for the inside of a patch given the edge TessFactors computed in the previous phase.
The input to each Patch Constant Join Phase shader are all the Control Point Phase’s Input and Output Control Points as well as all the Patch Constant Fork Phase’s output.
The output of each Patch Constant Join Phase shaders is a subset of the output Patch Constant data that does not overlap any of the outputs of the shaders from the Patch Constant Fork Phase or other Join Phase shaders.
Similar to the fork phase, to enable parallelism within a join phase mini-shader, any mini-shader can be declared to run in an instanced fashion, given a fixed instance count per patch. During execution, each instance of an instance mini-shader is identified by a JoinInstanceID(23.9) and is responsible for producing a unique output, typically by indexing an array of outputs. So for example, a single mini-shader instanced 2 times could output inside TessFactors for each inside direction of a quad patch.
The various phases of the Hull Shader are described in the Intermediate Language as separate shader models. A single Hull Shader program consists of a collection of the following shaders appearing in the order listed here:
hs_decls(22.3.14): Hull Shader State Declarations
hs_control_point_phase(22.3.21): Hull Shader Control Point Phase
hs_fork_phase(22.3.23): Hull Shader Patch Constant Fork Phase
hs_join_phase(22.3.26): Hull Shader Patch Constant Join Phase
From the point of view of the HLSL code author and API user, the name for the Hull Shader compiler target is simply hs_5_0
hs_control_point_phase(22.3.21) is a shader program with the following register model. Note the footnotes which provide a detailed discussion of output storage size calculations.
Register Type | Count | r/w | Dimension | Indexable by r# | Defaults | Requires DCL |
---|---|---|---|---|---|---|
32-bit Temp (r#) | 4096 (r# + x#[n]) | r/w | 4 | N | None | Y |
32-bit indexable Temp Array (x#[n]) | 4096 (r# + x#[n]) | r/w | 4 | Y | None | Y |
32-bit Input (v[vertex][element]) | 32(element)*32(vert) | r | 4 | Y | None | Y |
32-bit UINT Input vOutputControlPointID(23.7) | 1 | r | 1 | N | None | Y |
32-bit UINT Input PrimitiveID (vPrim) | 1 | r | 1 | N | n/a | Y |
Element in an input resource (t#) | 128 | r | 128 | Y | None | Y |
Sampler (s#) | 16 | r | 1 | Y | None | Y |
ConstantBuffer reference (cb#[index]) | 15 | r | 4 | Y | None | Y |
Immediate ConstantBuffer reference (icb[index]) | 1 | r | 4 | Y(contents) | None | Y |
Output Registers: | ||||||
32-bit output Vertex Data Element (o#) | 32, see (1) below | w | 4 | Y | None | Y |
(1) Each Hull Shader Control Point Phase output register is up to a 4-vector, of which up to 32 registers can be declared. There are also from 1 to 32 output control points declared, which scales amount of storage required. Let us refer to the maximum allowable aggregate number of scalars across all Hull Shader Control Point Phase output as #cp_output_max.
#cp_output_max = 3968 scalars
This limit happens to be based on a design point for certain hardware of 4096*32-bit storage here. The amount for Control Point output is 3968=4096-128, which is 32(control points)*4(component)*32(elements) - 4(component)*32(elements). The subtraction reserves 128 scalars (one control point) worth of space dedicated to the HS Phase 2 and 3, discussed below. The choice of reserving 128 scalars for Patch Constants (as opposed to allowing the amount to be simply whatever of the 4096 scalars of storage is unused by output Control Points) accommodates the limits of another particular hardware design. Note the Control Point Phase can declare 32 output control points, but they just can’t be fully 32 elements with 4 components each, since the total storage would be too high.
InstanceID(8.18) and VertexID(8.16) can be input as long as the previous Vertex Shader stage outputs them.
PrimitiveID(8.17) is also available as a scalar 32-bit integer input for each Control Point. PrimitiveID indicates the current patch in the Draw*() call, starting with 0. This PrimitiveID is the same value that the Geometry Shader would see for every patch if it input PrimitiveID - that is every point/line/triangle produced by the tessellator for a given patch has a single PrimitiveID for the entire Patch.
OutputControlPointID(23.7) is a scalar 32-bit integer input for each Control Point identifying which one it is [0..n-1] given n declared output Control Points.
Section Contents
(back to chapter)
10.8.1 HS Fork Phase Programs
10.8.2 HS Fork Phase Registers
10.8.3 HS Fork Phase Declarations
10.8.4 Instancing of an HS Fork Phase Program
10.8.5 System Generated Values in the HS Fork Phase
There can be 0 or more Fork Phase programs present in a Hull Shader. Each of them declares its own inputs, but they come from the same pool of input data – the Control Points. Each Fork Phase program declares its own outputs as well, but out of the same output register space as all Fork Phase and Join Phase programs, and the outputs can never overlap.
The following registers are visible in the hs_fork_phase(22.3.23) model.
The input resources (t#), samplers (s#), constant buffers (cb#) and immediate constant buffer (icb) below are all shared state with all other HS Phases. That is, from the API/DDI point of view, the Hull Shader has a single set of input resource state for all phases. This goes with the fact that from the API/DDI point of view, the Hull Shader is a single atomic shader; the phases within it are implementation details.
Note the footnotes which provide a detailed discussion of output storage size calculations.
Register Type | Count | r/w | Dimension | Indexable by r# | Defaults | Requires DCL |
---|---|---|---|---|---|---|
32-bit Temp (r#) | 4096 (r# + x#[n]) | r/w | 4 | N | None | Y |
32-bit indexable Temp Array (x#[n]) | 4096 (r# + x#[n]) | r/w | 4 | Y | None | Y |
32-bit Input Control Points (vicp[vertex][element]) (pre-Control Point Phase) | 32, see (1) below | r | 4(component)*32(element)*32(vert) | Y | None | Y |
32-bit Output Control Points (vocp[vertex][element]) (post-Control Point Phase) | 32, see (1) below | r | 4(component)*32(element)*32(vert) | Y | None | Y |
32-bit UINT Input PrimitiveID (vPrim) | 1 | r | 1 | N | n/a | Y |
32-bit UINT Input ForkInstanceID(23.8) (vForkInstanceID) | 1 | r | 1 | N | n/a | Y |
Element in an input resource (t#) | 128 | r | 128 | Y | None | Y |
Sampler (s#) | 16 | r | 1 | Y | None | Y |
ConstantBuffer reference (cb#[index]) | 15 | r | 4 | Y | None | Y |
Immediate ConstantBuffer reference (icb[index]) | 1 | r | 4 | Y(contents) | None | Y |
Output Registers: | ||||||
32-bit output Patch Constant Data Element (o#) | 32, see (2) below | w | 4 | Y | None | Y |
(1) The HS Fork Phase’s Input Control Point register (vicp) declarations must be any subset, along the [element] axis, of the HS Control Point input (pre-Control Point phase). Similarly the declarations for inputting the Output Control Points (vocp) must be any subset, along the [element] axis, of the HS Output Control Points (post-Control Point Phase).
Along the [vertex] axis, the number of control points to be read for each of the vicp and vocp must similarly be a subset of the HS Input Control Point count and HS Output Control Point count, respectively. For example, if the vertex axis of the vocp registers are declared with n vertices, that makes the Control Point Phase’s Output Control Points [0..n-1] available as read only input to the Fork Phase.
(2) The HS Fork and Join phase outputs are a shared set of 4 4-vector registers. The outputs of each Fork/Join phase program cannot overlap with each other. System Interpreted values such as TessFactors(10.10) come out of this space.
The declarations for inputs, outputs, temp registers, resource etc. in an HS Fork Phase program are like any standalone shader. A given HS Fork Phase program need only declare what it needs to read and write. Further, if it does not need to see all Input or Output Control Points, it can declare a subset of the counts for each, by declaring a smaller number on the [vertex] array axis than the corresponding number of Control Points actually available.
There is not a way to declare that a sparse set of the Control Points is read. E.g. a shader that needs read Input Control Points [0],[3], [11] and [15] would just declare the Input Control Point (vicp) register’s [vertex] axis size as 16. Note that if references to the Control Points from shader code use static indexing, it will be obvious to drivers exactly what subset of Control Points is actually needed by the program anyway.
Any individual HS Fork Phase program can be declared to execute instanced, with a declaration identifying a fixed instance count from 1 to 128 (128 is the maximum number of scalar Patch Constant outputs). The HS Fork Phase program executes the declared number of times per patch, with each instance identified by its 32-bit UINT input register vForkInstanceID(23.8).
Note that if the role of an instanced Fork Phase program is for each instance to produce a System Interpreted Value(4.4.5), say one of the edge TessFactors(10.10) for a quad patch per instance, the declarations for each of those outputs would identify the System Interpreted Value being produced, just like any other shader.
The HS Fork Phase can input PrimitiveID(8.17) in its own register just like the HS Control Point Phase. The value in this register is the same as what the HS Control Point Phase sees. The other special input register in the HS Fork Phase is vForkInstanceID(23.8), described previously.
The system doesn’t go out of its way to automatically provide other System Generated Values(4.4.4) (VertexID(8.16), InstanceID(8.18)) to the HS Fork Phase. Values like these are part of the Input Control Points (if they were declared to be there) already, so the HS Fork phase can read VertexID/InstanceID by reading them out of the Input Control Points.
The treatment of InstanceID(8.18) does seem strange, in that InstanceID would be the same for all Control Points in a Patch (indeed, unchanging across multiple patches), yet it shows up per-Input Control Point. However, this is consistent with the behavior elsewhere in the pipeline, where the first active stage that can input a System Generated Value (for InstanceID, that is the Vertex Shader) is responsible for passing the value down to the next stage via shader output (rather than the hardware feeding the value down to subsequent stages separately). For the Geometry Shader to see InstanceID, it also shows up in each input vertex there, just like it shows up in each Input Control Point in the Hull Shader.
Section Contents
(back to chapter)
10.9.1 HS Join Phase Program
10.9.2 HS Join Phase Registers
10.9.3 HS Join Phase Declarations
10.9.4 Instancing of an HS Join Phase Program
10.9.5 System Generated Values in the HS Join Phase
There can be 0 or more Join Phase programs present in a Hull Shader. Each of them declares its own inputs, but they come from the same pool of input data – the Control Points as well as the Patch Constant outputs of the Fork Phase programs. Each Join Phase program declares its own outputs as well, but out of the same output register space as all Fork Phase and Join Phase programs, and the outputs can never overlap.
The following registers are visible in the hs_join_phase(22.3.26) model. Note there are three sets of input registers: vicp (Control Point Phase Input Control Points), vocp (Control Point Phase Output Control Points), and vpc (Patch Constants). vpc are the aggregate output of all the HS Fork Phase programs(s). The HS Join Phase output o# registers are in the same register space as the HS Fork Phase outputs.
The input resources (t#), samplers (s#), constant buffers (cb#) and immediate constant buffer (icb) below are all shared state with all other HS Phases. That is, from the API/DDI point of view, the Hull Shader has a single set of input resource state for all phases. This goes with the fact that from the API/DDI point of view, the Hull Shader is a single atomic shader; the phases within it are implementation details.
Note the footnotes which provide a detailed discussion of output storage size calculations.
Register Type | Count | r/w | Dimension | Indexable by r# | Defaults | Requires DCL |
---|---|---|---|---|---|---|
32-bit Temp (r#) | 4096 (r# + x#[n]) | r/w | 4 | N | None | Y |
32-bit indexable Temp Array (x#[n]) | 4096 (r# + x#[n]) | r/w | 4 | Y | None | Y |
32-bit Input Control Points (vicp[vertex][element]) (pre-Control Point Phase) | 32, see (1) below | r | 4(component)*32(element)*32(vert) | Y | None | Y |
32-bit Output Control Points (vocp[vertex][element]) (post-Control Point Phase) | 32, see (1) below | r | 4(component)*32(element)*32(vert) | Y | None | Y |
32-bit Input (vpc[element]) (Patch Constant Data) | 32, see (3) below | r | 4 | Y | None | Y |
32-bit UINT Input PrimitiveID (vPrim) | 1 | r | 1 | N | n/a | Y |
32-bit UINT Input JoinInstanceID(23.9) (vJoinInstanceID) | 1 | r | 1 | N | n/a | Y |
Element in an input resource (t#) | 128 | r | 128 | Y | None | Y |
Sampler (s#) | 16 | r | 1 | Y | None | Y |
ConstantBuffer reference (cb#[index]) | 15 | r | 4 | Y | None | Y |
Immediate ConstantBuffer reference (icb[index]) | 1 | r | 4 | Y(contents) | None | Y |
Output Registers: | ||||||
32-bit output Patch Constant Data Element (o#) | 32, see (2) below | w | 4 | Y | None | Y |
(1) The HS Join Phase’s Input Control Point register (vicp) declarations must be any subset, along the [element] axis, of the HS Control Point input (pre-Control Point phase). Similarly the declarations for inputting the Output Control Points (vocp) must be any subset, along the [element] axis, of the HS Output Control Points (post-Control Point Phase).
Along the [vertex] axis, the number of control points to be read for each of the vicp and vocp must similarly be a subset of the HS Input Control Point count and HS Output Control Point count, respectively. For example, if the vertex axis of the vocp registers are declared with n vertices, that makes the Control Point Phase’s Output Control Points [0..n-1] available as read only input to the Join Phase.
(2) The HS Fork and Join phase outputs are a shared set of 4 4-vector registers. The outputs of each Fork/Join phase program cannot overlap with each other. System Interpreted values such as TessFactors(10.10) come out of this space.
(3) In addition to Control Point input, the HS Join phase also sees as input the Patch Constant data computed by the HS Fork Phase program(s). This shows up at the HS Fork phase as the vpc# registers. The HS Join Phase’s input vpc# registers share the same register space as the HS Fork Phase output o# registers. The declarations of the o# registers must not overlap with any HS Fork phase program o# output declaration; the HS Join Phase is adding to the aggregate Patch Constant data output for the Hull Shader.
The declarations for inputs, outputs, temp registers, resource etc. in an HS Join Phase program function the same was as HS Fork Phase declarations(10.8.3).
Any individual HS Join Phase program can be declared to execute instanced, with a declaration identifying a fixed instance count from 1 to 128 (128 is the maximum number of scalar Patch Constant outputs). The HS Join Phase program executes the declared number of times per patch, with each instance identified by its 32-bit UINT input register vJoinInstanceID(23.9).
Note that if the role of an instanced Join Phase program is for each instance to produce a System Interpreted Value(4.4.5), say one of the inside TessFactors(10.10) for a quad patch per instance, the declarations for each of those outputs would identify the System Interpreted Value being produced, just like any other shader.
System Generated Values are dealt with the same(10.8.5) way in the HS Join Phase as the HS Fork Phase. Instead of vForkInstanceID(23.8), in the Join Phase the same thing is called vJoinInstanceID(23.9). PrimitiveID(8.17) is available a standalone input register.
Section Contents
(back to chapter)
10.10.1 Overview
10.10.2 Tri Patch TessFactors
10.10.3 Quad Patch TessFactors
10.10.4 Isoline TessFactors
Hull Shader(10) Fork and Join Phase code can declare up to 6 of their output scalars as System Interpreted Values that identify various Tessellation Factors, driving how much tessellation the fixed function Tessellator should perform. For example, on a Quad there are 4 TessFactors for the edges, as well as 2 for the inside. HLSL exposes alternative (helper) ways to generate the inside tessfactors automatically from the edge TessFactors, e.g. deriving them by min/max/avg on the edge values, and possibly scaling based on user-provided scale values. The hardware does not understand anything about this helper processing (it just appears as shader code)
The optional (from the HLSL author point of view) tessellation factor processing results in HLSL compiler autogenerated shader code in either or both of the Fork and Join Phases. This standard processing can involve cleaning up of values, handling of special low TessFactor cases to prevent popping, and rounding of the values depending on the tessellation mode.
The final Tessellation Factors after this processing go to the fixed function Tessellator hardware – TessFactors for each edge and explicit TessFactors for the patch inside (as opposed to TessFactorScale the user specifies).
Downstream, Domain Shader(12) code may be interested in seeing all of the intermediate values generated during any optional TessFactor processing. For example, to be able to perform blending during Pow2 Partitioning tessellation, one might want to see the ratio between unrounded and rounded TessFactor values. To enable that, the auto-generated code in the Fork and/or Join Phases will output not only final TessFactor values for the tessellator, but also the intermediate values, so the Domain Shader can access them. There are at most 12 such additional values (in the case of a Quad Patch). Again, the hardware does not understand anything about these "helper" values, and they are not discussed in detail here.
The next sections describe just the TessFactors relevant to the hardware without discussing the various optional helper routines that HLSL provides to derive them.
Further information about how Tessellation Factors are interpreted is here(11.7.10).
float3 SV_TessFactor(24.8)
The first component provides the TessFactor for the U==0 edge of the patch.
The second component provides the TessFactor for the V==0 edge of the patch.
The third component provides the TessFactor for the W==0 edge of the patch.
The above hardware/system interpreted values must be declared in the same component of 3 consecutive registers (since indexing is on that axis).
float SV_InsideTessFactor(24.9)
This determines how much to tessellate the inside of the tri patch.
float4 SV_TessFactor(24.8)
The first component provides the TessFactor for the U==0 edge of the patch.
The second component provides the TessFactor for the V==0 edge of the patch.
The third component provides the TessFactor for the U==1 edge of the patch.
The fourth component provides the TessFactor for the V==1 edge of the patch.
The ordering of the edges is clockwise, starting from the U==0 edge (visualized as the "left" edge of the patch).
The above hardware/system interpreted values must be declared in the same component of 4 consecutive registers (since indexing is on that axis).
float2 SV_InsideTessFactor(24.9)
The first component determines how much to tessellate along the U direction of the inside of the patch.
The second component determines how much to tessellate along the V direction of the inside of the patch.
float2 SV_TessFactor(24.8)
The first component destermines the line density (how many tessellated parallel lines to generate in the V direction over the patch area).
The second component determines the line detail (how finely tessellated each of the parallel lines is, in the U direction over the patch area).
The above hardware/system interpreted values must be declared in the same component of 2 consecutive registers (since indexing is on that axis).
IsoLines are discussed further here(11.6)
The Hull Shader output Patch Constant data appears as 32 vec4 elements. The placement of the Final TessFactors are constrained as described in the previous sections – each grouping of TessFactors must appear in a specific order in the same component of consecutive registers/elements in the Patch Constant Data. E.g. For Quad Patches, the four Final Edge TessFactors in a fixed order make up one grouping, and the two Final Inside TessFactors in a fixed order make up another separate grouping.
Shader indexing of the Patch Constant data across the 32 vec4 elements is restricted, due to the limitations of a particular hardware implementation, as follows:
Since the Hull Shader has multiple phases, each of which can be instanced (e.g. multiple Control Points in the Control Point phase, or instanced Fork or Join Phases), the "ret*" (return(22.7.16) or conditional return(22.7.17)) shader instruction is defined to end only the current instance of the current phase. So a "ret*" in the Control Point Phase would only finish the current Control Point invocation without affecting the others or other phases. Or a "ret*" in a Fork or Join Phase program would only end that instance of that program without affecting other instances (if it is instanced) or other Fork/Join programs.
The HS State Declaration Phase can optionally include a fixed float32 MaxTessFactor(22.3.20) in the range {1.0...64.0}.
This MaxTessFactor declaration(22.3.20) is useful when application knows the maximum amount of tessellation it could possibly ask for through the TessFactor values will output from the Hull Shader. Communicating this knowledge to the device allows it to optionally take advantage and perform better scheduling of resources on the GPU.
If a MaxTessFactor is declared, it is enforced by HLSL autogenerated TessFactor clamping code as the last step in the calculation of all of the following hardware System Interpreted Values (whose meanings were described earlier):
SV_TessFactor
SV_InsideTessFactor
For simplicity only a single MaxTessFactor value can be declared, and when it is present, it is applied to all the TessFactors listed above.
The device sees the MaxTessFactor declaration as a part of the Hull Shader. The knowledge of this declaration is what hardware can optionally take advantage of to optimize Tessellation performance for content going through that Hull Shader, versus an otherwise identical Hull Shader without the declaration.
If HLSL fails to enforce the MaxTessFactor when it is declared (by clamping the HS output TessFactors), and a TessFactor larger than MaxTessFactor arrives at the Tessellator, the Tessellator’s behavior is undefined. Hitting this undefined situation is a Microsoft HLSL compiler (or driver compiler) bug, not the fault of the shader author or hardware.
Note that independent of this optional application-defined MaxTessFactor, the Tessellator always performs some additional basic clamping and rounding of Final TessFactors as appropriate for the situation, described later (5.5). Those manipulations guarantee the hardware behavior by limiting the range of inputs possible. The only exception to that well defined hardware interface is this MaxTessFactor declaration which must rely on HLSL to generate code to enforce it. The reason it is the responsibility of HLSL to enforce consistency in this one case is it was too late in the spec process to arrive at any consistent hardware definition here, either by defining what the hardware behavior is if MaxTessFactor was not enforced but then exceeded at runtime, or getting all hardware vendors enforce the same MaxTessFactor clamping in hardware.
Chapter Contents
(back to top)
11.1 Tessellation Introduction
11.2 Tessellation Pipeline
11.3 Input Assembler and Tessellation
11.4 Tesellation Stages
11.5 Fixed Function Tessellator
11.6 IsoLines
11.7 Tessellation Pattern
11.8 Enabling Tessellation
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
The tessellation model processes a patch at a time, either a quad, tri or "isoline" domain, and does not embody any specific surface representation. It strictly generates domain locations that are fed to a programmable shader (Domain Shader(12)) that is responsible for generating positions and any ancillary shading information (texture coordinates, tangent frames, normals, etc.). The domain locations are water tight across a boundary if identical level of detail is used, otherwise the hardware plays no role in ensuring crack free surfaces. This specification does not cover any specific surface representation, or how to map representations to the given pipeline.
Requirements
See the D3D pipeline(2) diagram to see how Tessellation (Hull Shader(10), Tessellator(11) and Domain Shader(12)) fits in.
The Input Assembler(8) has a new primitive topology called "patch list", which is accompanied by a vertex count per patch: [1..32]. This is also described under Patch Topologies(8.11).
All existing IA behaviors work orthogonally with patches. i.e. indexing, instancing, DrawAuto etc.
Incomplete patches are discarded – for example if the vertex count is 32 per patch, and a Draw call specifies 63 vertices, one 32 vertex patch will be produced, and the remaining 31 vertices will be discarded.
Here are pointers to the stages involved in Tessellation, in the order of data flow:
Fixed Function Tessellator(11.5) (this chapter, below)
This fixed function stage takes floating point TessFactor values as input and generates a tessellation of the domain. The domain can be tri, quad or isoLine (see next section for a definition of isoLines).
The tessellator generates a couple of things:
Note the domains are defined such that for isoLines and quads, the V direction is clockwise from the U direction. For tri domain, UVW are clockwise, in that order.
Adjacency(8.15) information is not available when using the tessellator - only independent points, lines or triangles are generated. The order that points/lines/triangles and their vertices are produced must be invariant between similar tessellator invocations on the same device, but no explicit order is prescribed.
The isoLine domain is a specialized form of the quad domain. It is the only domain that can produce tessellated lines. For isoLines, the U direction over a quad domain is the direction tessellated lines are drawn (lines of constant V). There are two TessFactor(10.10.4) values:
The first is the line density, which is always rounded to integer and determines how many U-parallel tessellated line segments to generate across the V direction. The spacing of these line segments across V is uniform, starting at V=0. So if the line density is 1, a single tessellated line is generated from (U=0,V=0) to (U=1,V=0). If the line density is 2, the first tessellated line is generated from (0,0) to (1,0) and the second tessellated line is generated from (0,0.5)-(1,0.5). Notice that no line is ever generated at V=1.
The second TessFactor is the line detail, determining how much to tessellate each line of constant V.
For more concrete info on the tessellation pattern for isolnes see IsoLine Pattern Details(11.7.8).
Section Contents
(back to chapter)
11.7.1 Overview
11.7.2 Tessellation Pattern Overview
11.7.3 Fractional Partitioning
Details of the point placement and connectivity described in words in this section.
A more concrete description can be found in the reference fixed function tessellator code, entirely encapsulated in the following C++ files:
The inside of a triangle/quad patch is a tessellated triangle/square based on a specified InsideTessFactor(s). For a triangle, there is a single TessFactor(10.10.2) for the inside region of the patch. For a quadrilateral, there are 2 inside TessFactors(10.10.3).
HLSL exposes helpers that can optionally derive inside TessFactors from the edge TessFactors (these amount to shader code, so the hardware doesn't need to know about them). For example in the case of a quad patch, the helpers have a couple of options for deriving inside TessFactors – 1-axis and 2-axis. In the 1-axis mode, the inside TessFactor reduction is applied on all 4 edges producing a single inside TessFactor. In the 2-axis mode, the reduction from 4 edge TessFactors is divided into two separate parts. The V==0 and V==1 edge TessFactors are reduced to a single TessFactor for the V direction of the interior. Similarly the U==0 and U==1 TessFactors are reduced to a single TessFactor for the U direction on the interior.
The boundaries of the patch transition between the inside TessFactor(s) and each per-edge TessFactor.
There are two basic flavors of fractional tessellation: either using an even number of segments (intervals) on an edge or an odd number. When using an even number of segments the coarsest an edge can be refined is to have two segments an edge, so it is impossible to model a level of detail with a single segment.
For integer partitioning, TessFactors are rounded to integer. The parity (even/odd) of each edge and inside TessFactor after rounding determines how that area is tessellated: an odd integer TessFactor matches odd fractional tessellation at the same TessFactor. Similarly, an even integer TessFactor matches even fractional tessellation at the same TessFactor.
For pow2 partitioning, TessFactors are rounded to a power of 2, and tessellation of pow2 TessFactors matches even fractional tessellation at the same TessFactor, but in addition the power of 2 mode can go down to 1 segment on any side (1 is a power of 2). From the hardware point of view there is no distinction between pow2 and integer - the hardware doesn't do the rounding of the TessFactors to pow2. That rounding is the responsibility of the HLSL compiler, given the shader being authored using the appropriate helper intrinsics in shader code (not discussed here).
Mapping Vertices to Texels 1:1 in an Application
Tri vs Quad Density Comparison
Example: Displacement Mapping
The order that geometry is generated for a patch must be repeatable on a device, however no particular ordering of the geometry within a patch is prescribed. A strict requirement is that all geometry for a given patch flows down the pipeline before any geometry for subsequent patches.
Suppose the rasterizer is the next active stage in the pipeline after tessellation, and there are vertex attributes that are declared in the Pixel Shader with constant interpolation. The leading vertex, used to provide the constant attribute for any individual line or triangle, can be any of the vertices in the line or triangle (albeit repeatable for a given patch and tessellator configuration on a device).
When a patch topology is used, PrimitiveID(8.17) identifies which patch in the Draw*() call is being processed, starting from the Hull Shader onward. Even though tessellation may produce multiple points/lines/triangles, for a given patch, all of the primitives generated for it have the same PrimitiveID. As such, the freedom of point/line/triangle ordering within a patch is not visible to shader code. When a patch topology is used, the true "primitive" is the patch itself.
The TessFactor number space roughly corresponds to how many line segments there are on the corresponding edge. This isn’t a precise definition of the number of segments because different tessellation modes snap to different numbers of segments (i.e. integer versus fractional_even versus fractional_odd).
For integer partitioning, TessFactor range is [1 ... 64] (fractions rounded up).
For pow2 partitioning, TessFactor range is [1,2,4,8,16,32,64]. Anything outside or in between values in this set is rounded to the next entry in the set by HLSL code... so from the hardware point of view, pow2 partitioning technically isn't different from integer partitioning.
For fractional odd partitioning, TessFactor range is [1 ... 63]. Odd TessFactors produce uniform partitioning of the space. Other TessFactors in the range produce a segment count that is the next odd TessFactor higher, transitioning the point locations based on the distance between the nearest lower odd TessFactor and nearest greater odd TessFactor.
For fractional even tessellation, TessFactor range is [2 ... 64]. Even TessFactors produce uniform partitioning of the space. Other TessFactors in the range produce a segment count that is the next even TessFactor higher, transitioning the point locations based on the distance between the nearest lower even TessFactor and nearest greater even TessFactor.
For the IsoLine domain, the line detail TessFactor honors all the above modes. However the line density TessFactor always behaves as integer – [1 ... 64] (fractions rounded to next).
This particular clamp on TessFactors is discussed here(10.13), and is independent of the hardware clamps defined in the rest of this section.
The following describes the float32 patch edge TessFactor range that the hardware Tessellator must accept from the Hull Shader.
First of all, if any edge TessFactor is <= 0 or NaN, the patch is culled.
Otherwise, hardware must clamp each edge input TessFactor to the range specified below.
Partitioning | Min Edge TessFactor | Max Edge TessFactor | Comments |
---|---|---|---|
Even_Fractional | 2 | 64 | |
Odd_Fractional | 1 | 63 | |
Integer (Pow2 maps to integer in hardware) | 1 | 64 | After clamping, round result to next integer. |
For IsoLines, the LineDensity Tessfactor (which is how many constant V iso-lines to draw) is clamped by the hardware to [1...64] and rounded to the next integer.
In addition to patch edge TessFactors, hardware will be given inside TessFactors from the Hull Shader. There are two inside TessFactors for quad patches (U and V axes), and one inside TessFactor for tri patches.
These HS outputs may have been derived (optinally) from the edge TessFactors via some operation such as max or avg in Hull Shader code autogenerated by HLSL. This derivation may involve low TessFactor fixups to prevent popping as TessFactors transition through extreme cases. Such processing is just shader code, irrellevant to the hardware.
For the final inside TessFactors coming out of the Hull Shader, the following is pseudocode for the hardware validation hardware must do, effectively creating safe bounds on the complexity of cases the hardware tessellation algorithm has to handle.
// Compute HWInsideTessFactorU/V for quad patch (similar tri patch case has only one axis), // given HSOutputInsideTessFactorU/V + 4 edge TessFactors. // This is just the fixed function hardware processing, independent of shader pre-conditioning // of the TessFactors (which the hardware does not need to know about). float lowerBound, upperBound; switch(partitioning) { case integer: case pow2: // don’t care about pow2 distinction for validation, just treat as integer lowerBound = 1; upperBound = 64; break; case even_fractional: lowerBound = 2; upperBound = 64; break; case odd_fractional: #define EPSILON 0.0000152587890625 // 2^(-16), min positive fixed point fraction if( any TessFactor, edge or inside is greater than (1.0 + EPSILON/2) ) { // If any Tessfactor will be > 1 after rounding during // the float to fixed point conversion that happens later // then make all inside TessFactors > 1. lowerBound = 1.0 + EPSILON; } else // all are <= 1.0f or NaN { lowerBound = 1; } upperBound = 63; break; } HWInsideTessFactorU = min( upperBound, max( lowerBound, HSOutputInsideTessFactorU ) ); HWInsideTessFactorV = min( upperBound, max( lowerBound, HSOutputInsideTessFactorV ) ); // A tri patch only has one insideTessFactor instead of U/V // Note the above clamps map NaN to lowerBound based on D3D/IEEE754R min/max rules if( integer or pow2 partitioning ) { round HWInsideTessFactorU to next integer (don’t care about pow2 distinction for validation) round HWInsideTessFactorV to next integer // tri patch only has one insideTessFactor instead of U/V } // After this, all TessFactors are converted to .16 fixed point using D3D float->fixed // conversion rules(3.2.4.1) (incl round-to-nearest-even). Topology and domain coordinate placement // is done based on the fixed point TessFactors.
If any of the edge TessFactors from the HS for a patch are <= 0 or NaN, the patch is culled. No Domain Shader invocations or anything later in the pipeline are produced for that patch.
A discussion elsewhere about enabling and disabling(11.8) of tessellation discusses how patch culling interacts with tessellation disabled, but patches being streamed out to memory.
A shared edge has to generate identical domain locations for crack free tessellation to be possible. Domain Shader authors are responsible for achieving this, given some guarantees from the hardware. First, hardware tessellation on any given edge must always produce a distribution of domain points symmetric about the edge based on the TessFactor for that edge alone. Second, the parameterization of each domain point (U/V for quad or U/V/W for tri) must produce “clean” values in the space [0.0 ... 1.0]. “Clean” means that given a domain point on one side of the edge, with the parameter for that edge (say it is U) in [0 ... 0.5], the mirrored domain point produced on the other side, call it U' in [0.5 ... 1.0] will have a complement satisfying (1-U') == U exactly.
Even if a neighboring patch sharing an edge happens to produce a complementary parameterization (U moving in the other direction, and/or U/V swapped), both side’s parameterization for each shared edge domain point will be equivalent because they are clean.
Having clean parameterization means that DS authors can write domain point evaluation algorithms with a carefully constructed order of operations that is guaranteed to produce the same result even if the control points for the patch are traversed in reverse order and/or with the parameter space complemented.
Tessellator input float32 TessFactor values are immediately converted to fixed point. Note this is after float processing of TessFactors, such as Inside TessFactor derivation has been done by HLSL generated shader code in HS Patch Constant Fork or Join Phases. Once the final TessFactors have been converted to fixed point, all remaining tessellator arithmetic (computing domain locations), is performed using fixed point arithmetic with 16 bits of fraction. The last step in domain point coordinate calculation is to convert the coordinates back to float32 for input to the Domain Shader.
The fact that output U/V/W domain coordinates(23.10) have been quantized to 16 bit fixed point means there is a uniform spacing of representable values across the [0...1] range. This uniform spacing facilitates the symmetry and watertightness issues discussed above.
Due to the fixed point arithmetic involved, it is possible for the tessellator to produce degenerate lines or triangles, where each vertex has identical domain coordinates. This will not be visible if the primitives are sent to the rasterizer, because they will be culled. However, if the Geometry Shader and/or Stream Output are enabled, the degenerate primitives will appear, and it is the application’s responsibility to be robust to this. For example, Geometry Shader code could check for and discard degenerates if that turns out to be the only way to avoid the algorithm being used from falling over on the degenerate input.
If the Tessellator’s output primitive is points (as opposed to triangles or lines), this scenario requires only unique points within a patch to be generated. The one exception is points that are on the threshold of merging, if TessFactors were to incrementally decrease, may appear in the system as duplicated points (with the same U/V coords) in an implementation dependent way.
What does 16-bit fixed point math for the domain coordinate generation mean?
Suppose a single patch is drawn 64 meters wide.
There is enough precision to place points at 2 mm resolution.
Section Contents
(back to chapter)
11.8.1 Final D3D11 Definition for Enabling Tessellation
The presence of both a Hull Shader and Domain Shader enables tessellation. When a Hull Shader and Domain Shader are bound, the Input Assembler topology is required to be a patch type (otherwise behavior is undefined). If a Hull Shader is bound and no Domain Shader is bound, or vice versa, the behavior is undefined.
Patches can be used at the Input Assembler without tessellation (no Hull Shader or Domain Shader), as long as the Geometry Shader and/or Stream Output are being used.
When tessellation is disabled (no Hull Shader and no Domain Shader bound), patches arriving at the Geometry Shader cause the GS to be invoked once per patch. Each GS invocation sees all the Control Points of the patch as an array of input vertices.
Allowing the GS to be invoked with patches allows it to effectively input non-traditional topologies (beyond points, lines, triangles). E.g. to invoke the GS with a cube as its input primitive, one could send 8 Control Point patches.
The GS does not support output of patches. The output of the GS remains one of: point list, line strips or triangle strips.
Sending un-tessellated patches to NULL GS + Stream Output is valid. This enables, for example, Control Points that have gone through the Vertex Shader to be streamed out for multi-pass or reuse scenarios. Note, however, it is not possible for Hull Shader outputs to be streamed out (or go into the GS) - the presence of the Hull Shader requires a simultaneous Domain Shader and enables Tessellation – both of which consumes Hull Shader output entirely.
When un-tessellated patches arrive at Stream Output, each Control Point in the patch appears as a single vertex for Stream Output. This definition is similar to the way NULL GS + Stream Output behaves with traditional primitive topologies such as triangle lists. As with other primitive types, only complete patches get written out; if there is not enough room to store a complete patch, it is discarded.
It could have been defined that Control Points arriving at the rasterizer are interpreted as points and rasterized as such, but that would have required a RenderTarget-space projected "position" to be present in the control points, and the application would have to have wanted to draw them as points. This is an extremely unlikely scenario, not worth targeting. Therefore, if an un-Tessellated patch arrives at the Rasterizer, behavior is undefined and the debug runtime will call this out as an error.
Original Definition for Enabling Tessellation
The behaviors described so far in this section are the result of making cutbacks from the originally defined behavior. The cutbacks were made due to concerns over how the design was unfriendly to certain choices of D3D11 hardware implementations, resulting in among other issues unreasonable hardware and driver complexity.
The original behavior is documented below for the sake of history,formatted like this. It is a superset of the final behavior above, so a lot of the content appears the same. Briefly, the most interesting extra bit of functionality was being able to pass Hull Shader outputs to GS/StreamOutput without tessellation. Tessellation was enabled only by the presence of a Domain Shader (which then required a Hull Shader). Without a Domain Shader, tessellation was disabled, but he Hull Shader could still be present, outputting control points downstream.
Enabling Tessellation (this crossed out text is no longer representative of D3D11)
The presence of a Domain Shader enables tessellation. When a Domain Shader is bound, the Input Assembler topology is required to be a patch type, and a Hull Shader must also be bound, otherwise the behavior is undefined (debug error).
The absence of a Domain Shader disables tessellation. The Input Assembler topology is still allowed to be a patch type when tessellation is disabled. The following subsections describe what this means.
Sending Un-Tessellated Patches to the Geometry Shader
When tessellation is disabled, patches arriving at the Geometry Shader (with or without a Hull Shader Present) cause the GS to be invoked once per patch. Each GS invocation sees all the Control Points of the patch as an array of input vertices. Patch Constant data from the Hull Shader, such as Tessellation Factors, are not visible to the GS.
Allowing the GS to be invoked with patches allows it to effectively input non-traditional topologies (beyond points, lines, triangles). E.g. to invoke the GS with a cube as its input primitive, one could send 8 Control Point patches.
Sending Un-Tessellated Patches to Null GS + Stream Output
Sending Un-Tessellated Patches to NULL GS + Stream Output is valid. This enables, for example, Control Points that have gone through the Vertex Shader and/or Hull Shader to be streamed out for multi-pass or reuse scenarios.
Each Control Point in the patch appears as a single vertex for Stream Output. This definition is similar to the way NULL GS + Stream Output behaves with traditional primitive topologies such as triangle lists. As with other primitive types, only complete patches get written out; if there is not enough room to store a complete patch, it is discarded.
If the HS is active, that means the HS output Control Points can be streamed out. Without the HS active, the VS output for each Control Point in a patch can be streamed out.
Patch Constant data output by the Hull Shader, such as Tessellation Factors, are not available to Stream Output. As a workaround, an application that needs to stream out Patch Constant data could set up the tessellator to run, but then have the Domain Shader flag for discarding (such as assigning a bad vertex position) all but the first n domain points for the patch. The n domain points (where n is chosen to fit all the Patch Constant data across n vertices’ storage) would save out all the patch data from the Domain Shader. The GS/Stream Output could then send the data to memory as a sequence of individual points.
If the HS culls a patch (by specifying an edge Tessellation factor <= 0) when tessellation is disabled, the "cull" has no effect on Stream Output of the patch. This choice was made because it is deemed not worth defining that the Stream Output stage must be able to interpret some Patch Constant data (TessFactors) to make a decision about what to stream out. Thus if un-tessellated patches are being sent to Stream Output, there is no way to cull them.
Sending Un-Tessellated Patches to the Rasterizer
It could have been defined that control points arriving at the rasterizer are interpreted as points and rasterized as such, but that would have required a RenderTarget-space projected "position" to be present in the control points, and the application would have to have wanted to draw them as points. This is an extremely unlikely scenario, not worth targeting. Therefore, if an un-Tessellated patch arrives at the Rasterizer, behavior is undefined and the debug runtime will call this out as an error.
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
For a Tessellation overview, see the Tessellator(11) section.
The Domain Shader instruction set is listed here(22.1.5).
Inputs for this stage are the 2D or 3D domain location(23.10) generated by the tessellator(11) and all of the data generated by the Hull Shader(10). This latter data is visible to all domain points in a patch. In all other ways this shader is effectively analogous to a Vertex Shader(9).
The Domain Shader can see all the data output by both phases of the Hull Shader, as well as the domain location of a particular point. The Domain Shader is invoked for every domain location generated by the Tessellator.
The following registers are available in the ds_5_0 model.
Register Type | Count | r/w | Dimension | Indexable by r# | Defaults | Requires DCL |
---|---|---|---|---|---|---|
32-bit Temp (r#) | 4096 (r# + x#[n]) | r/w | 4 | N | None | Y |
32-bit indexable Temp Array (x#[n]) | 4096 (r# + x#[n]) | r/w | 4 | Y | None | Y |
32-bit Input Control Points (vcp[vertex][element]) | 32, see (1) below | r | 4(component)*32(element)*32(vert) | Y | None | Y |
32-bit Input Patch Constants (vpc[vertex]) | 32, see (1) below | r | 4 | Y | None | Y |
32-bit input location in domain (vDomain(23.10).xy, vDomain(23.10).xyz)) | 1 | r | 3 | N | n/a | Y |
32-bit UINT Input PrimitiveID (vPrim) | 1 | r | 1 | N | n/a | Y |
Element in an input resource (t#) | 128 | r | 128 | Y | None | Y |
Sampler (s#) | 16 | r | 1 | Y | None | Y |
ConstantBuffer reference (cb#[index]) | 15 | r | 4 | Y | None | Y |
Immediate ConstantBuffer reference (icb[index]) | 1 | r | 4 | Y(contents) | None | Y |
Output Registers: | ||||||
32-bit output Vertex Data Element (o#) | 32 | w | 4 | Y | None | Y |
(1) The domain shader sees the Hull Shader outputs in 2 separate sets of registers. The vcp registers can see all of the Hull Shader’s output Control Points. The vpc registers can see all of the Hull Shader’s Patch Constant output data.
Since code for Hull Shader Patch Constant Fork or Join Phases output TessFactors using names such as SV_TessFactor, the DS must match those declarations on the equivalent vpc input if it wishes to see those values.
InstanceID(8.18) and VertexID(8.16) can be input as long as the Hull Shader output these values (per-Control Point).
The domain location is another System Generated Value, appearing in its own input register (vDomain(23.10)).
The final set of System Values are the various TessFactors produced by the Hull Shader, discussed elsewhere(10.10). These can be declared as input out of part of the input Patch Constant (vpc) registers.
Chapter Contents
(back to top)
13.1 Geometry Shader Instruction Set
13.2 Geometry Shader Invocation and Inputs
13.3 Geometry Shader Output
13.4 Geometry Shader Output Data
13.5 Geometry Shader Output Streams
13.6 Geometry Shader Output Limitations
13.7 Partially Completed Primitives
13.8 Maintaining Order of Operations Geometry Shader Code
13.9 Registers
13.10 Geometry Shader Input Register Layout
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
The Geometry Shader instruction set is listed here(22.1.6).
When a Geometry Shader is active, it is invoked once for every primitive passed down or generated earlier in the Pipeline. Each invocation of the Geometry Shader sees as input the data for the invoking primitive, whether that is a single point, a single line, a single triangle, or the Control Points for a Patch (if a Patch arrives with Tessellation disabled). A triangle strip from earlier in the Pipeline would result in an invocation of the Geometry Shader for each individual triangle in the strip (as if the strip were expanded out into a triangle list). All the input data for each vertex in the individual primitive is available (i.e. 3 verts for triangle), plus adjacent vertex data if applicable/available. All vertex inputs/Element-layout/adjacency to be read must be declared, and this declaration must be compatible with the data being produced above in the Pipeline. Other inputs include textures, and also Primitive ID as a 32-bit scalar integer input .
An alternate method of invoking the Geometry Shader is via instancing. A GS Instancing declaration(22.3.7) specifies a (fixed) number of times for the GS to be invoked for each primitive. Each instance that executes is identified by a GS instance ID value [0...n-1], and the outputs of each GS instance are appended to the end of the outputs of the previous invocation (with an implicit cut of the topology between instances - see the description of cutting further below). The maximum instance count that can be declared is 32, but for a full explanation of constraints of GS instancing, see the description of the GS instancing declaration(22.3.7)
Some background: The D3D10 Geometry Shader had a limit on the amount of vertex data that a single shader invocation can emit. The limit is 1024 scalars of data (fatter vertices means fewer vertices can be emitted). The shader program must statically declare the maximum amount of vertices it intends to output. It was desirable to relax this limit in some fashion.
Another limitations of the D3D10 Geometry Shader design was the GS emits vertices is implicitly serial. e.g. if a GS program that wants to project an input triangle onto 6 cube faces, it must project to each cube face and emit geometry for each face one at a time. It was desirable to have a way a GS program could be authored to explicitly reveal to the hardware when the calculations to produce different batches of geometry form the same GS program are independent of each other. This way, hardware can execute each batch of vertex generation in parallel.
The GSInvocations Pipeline Statistics counter(20.4.7) reports the number of primitives input to the GS multiplied by the instance count per primitive. That is, each "instance" counts as a GSInvocation.
The Geometry Shader outputs data one vertex at a time using the "emit"(22.8.3) command. The topology of these vertices is determined by a fixed declaration(22.3.8), choosing one of: pointlist, linestrip, or trianglestrip as the output for the GS. Strips can be restarted by using the "cut"(22.8.1) command, which ends the current strip at the last emitted vertex, so that the next emitted vertex begins a new strip. The "emitThenCut"(22.8.5) instruction both emits a vertex, and stops the current strip on this vertex, so that the next vertex that is emitted begins a new strip. For pointlist output, "cut" has no effect (including the "cut" part of "emitThenCut").
The outputs of a given invocation of the Geometry Shader are independent of other invocations (though ordering(4.2) is respected). A Geometry Shader emitting triangle strips will start a new strip on every invocation. In addition, as mentioned above, an invocation of the Geometry Shader can produce multiple separate strips using "cut"s.
The Geometry Shader must declare the maximum number of vertices an invocation of the Shader will output. The total amount of data that a Geometry Shader invocation can produce is 1024 32-bit values. The calculation of the Stream Output record with one or more streams is as follows: Given that each stream declares its outputs in its own clean slate view of the full output register set, the total output record size is the number of scalars in the union of all the stream declarations. This size multiplied by the max output vertex count must not exceed 1024. When Geometry Shader instancing is used, the Stream Output record size restriction applies to each instance individually
With only a single output stream, the above rule matches D3D10.
The limit on Geometry Shader output is based on how many "emit" calls the Shader makes. The limit on Geometry Shader output is not affected in any way by the size of the output buffer(s) that are present or whether or not they have even been bound. Even if no output Buffers happen to be bound to a Stream and a vertex is output (and therefore dropped), it still counts against the limit.
Hardware must enforce the limit above by stopping writes if the Geometry Shader program continues after emitting the declared maximum number of vertices.
See the documentation of the GS maximum output vertex count declaration(22.3.5), as well as the GS Instancing declaration(22.3.7) for more details.
The o# registers to be written by the Geometry Shader must be declared (e.g. "dcl_output o[3].xyz"). The set of these declarations defines which registers are read when an "emit"(22.8.3) command is issued, defining a vertex. Therefore, all vertices emitted by the Geometry Shader have the same data layout.
When a Geometry Shader output is identified as a System Interpreted Value(4.4.5) (e.g. "renderTargetArrayIndex" or "position"), hardware looks at this data and performs some behavior dependent on the value, in addition to being able to pass the data itself to the next Shader stage for input. When such data output from the Geometry Shader has meaning to the hardware on a per-primitive basis (such as "renderTargetArrayIndex" or "ViewportArrayIndex"), rather than on a per-vertex basis (such as "clipDistance" or "position"), the per-primitive data is taken from the Leading Vertex(8.14) emitted for the primitive.
Each time an "emit"(22.8.3) or "emitThenCut"(22.8.5) is issued the contents of the declared Geometry Shader output registers are read to produce a vertex, and in addition the Geometry Shader outputs immediately become uninitialized. In other words, if any output data needs to be repeated for consecutive vertices, the Geometry Shader program must write the data over again to the output registers for each vertex.
The Geometry Shader outputs have a close relationship to the Stream Output Stage/functionality, described here(14.3).
STREAM: For the discussion here, let us define a stream as a sequence of writes of a structure of data out of a shader. A Geometry Shader can output up to 4 streams, each at different rates, with independent data going to each stream. The utility of this is in conjunction with Stream Output(14).
BUFFER: For the discussion in this section, in the context of Stream Output(14), a Buffer is a resource in memory that can receive any subset of the data from one stream. A stream can have its data split out (not replicated) across multiple buffers, and this mapping is defined by a Stream Output declaration (which is not visible in the Geometry Shader code). A Buffer cannot receive data from multiple streams at once.
4 streams can be declared(22.3.9) by the GS. Without the GS present, all vertex data is a single stream.
When the GS defines multiple streams, variants of the "emit"(22.8.3), cut(22.8.1) or "emitThenCut"(22.8.5) instructions which take an immediate stream # [0..4-1] parameter must be used by the GS to indicate which stream is being output. These instructions are "emit_stream"(22.8.4), cut_stream(22.8.2) and "emitThenCut_stream"(22.8.6), respectively.
From the point of view of the Geometry Shader, all the declarations of its output registers appear multiple times indepdendently, once per stream. A statement appears in the bytecode setting the current output stream being declared, and subsequent declarations of output registers define what data gets latched when vertex data is emitted to each stream. The set of output registers available to the GS program during execution is the union of all output registers declared for each stream (individual streams can use the same output registers). When a vertex is emitted to a given stream, only the output registers declared for that stream feed the output to the stream, however ALL declared output registers for all streams become uninitialized.
If output register indexing is declared(22.3.30), specifying a range of output registers that can be dynamically indexed, the register space that can be declared for indexing is the union of all stream output register declarations.
When outputting to multiple streams, the GS output topology declaration(22.3.8) must appear for each stream, and must bet set to "point". In other words, multiple streams means that non-point output is unavaliable.
The points-only limitation with multi-stream output was a hardware limitation during the design. Perhaps in future DX releases this can be relaxed - that is to allow arbitrary topologies in each stream. An example would be to output triangles to one stream that goes to the rasterizer, while sending points to another stream that goes to Stream Output at a different frequency for compiling a list of coordinates to revisit with some postprocessing later. Or to render some triangles while saving off rejected ones.
When outputting to only a single stream, the output from the GS can be a point list, line strip or triangle strip (strips are expanded to lists when streamed to memory), or a patch list. Output of a patch list from the GS is only valid for Stream Output, not for rasterization (undefined behavior).
When outputting to multiple streams, one of them can be sent to the rasterizer (independently of whether it is also streaming to memory). The Stream Output declaration specifies this (outside the shader code, but appearing to the driver side by side). Interpolation modes, System Interpreted Values and System Generated Values can be declared on any combination of Streams in the Shader, but the only ones that have any meaning are the ones corresponding to the Stream (if any) declared (outside the shader) as going to the rasterizer (if any). For Streams that are not going to the Rasterizer, the names are ignored. Notice that the same shader could be created with different Stream Output declarations, each time selecting a different Stream to go to Rasterization.
If a GS with streams is passed to CreateGeometryShader at the API/DDI (meaning there is no Stream Output declaration or rasterizer stream selection), the active stream defaults to 0. So stream 0 goes to rasterization if rasterization is enabled, and the absence of a Stream Output declaration means nothing is streamed out to memory. If the stream selected to go to rasterization isn’t declared in the GS or doesn’t include a position and rasterization is enabled, behavior is undefined, just as with any shader that feeds the rasterizer without a position.
Sending one of the streams to rasterization with multiple streams isn't a particularly interesting feature for now, since in the multi-stream case all streams are point lists.
Interpolation modes declared for the outputs on one Stream don’t have to match those on another Stream. Note that when the Geometry Shader is created, a choice of which stream (if any) is going to rasterization is made, so the driver shader compiler only needs to pay attention to interpolation modes and System Interpreted Values (such as "position") only on at most a single Stream’s declarations
When the application knows that some GS outputs will be treated as per-primitive constant at the subsequent Pixel Shader, the Geometry Shader need only initialize such output registers when they represent the Leading Vertex(8.14) for a primitive. For example, on the last 2 vertices in a triangle strip, outputs that (on Leading vertices) would have be treated as constant by the Pixel Shader need not be written. If Stream Output is being used, which has no knowledge of what data is per-primitive constant or not, in the expansion of GS output strips to lists, Stream Output simply dumps out all the declared outputs for each vertex for each primitive. If the GS chooses to not write out what it knows is non-Leading-Vertex data for Elements that will be used to drive per-primitive constants in a later pass, uninitialized data gets written to these unwritten Elements in Stream Output. This is fine as long as the application never attempts to later read such uninitialized Stream Output data. If the application later recirculates the Streamed Out data in a way that correctly interprets only per-primitive constant data at Leading Vertices and never interprets the uninitialized data at non-Leading-Vertices (even though it does get read back into the pipeline), no undefined behavior results.
There is a mechanism to retrieve the number of output primitives in the output buffer. Further details regarding writing to memory from the Geometry Shader are described elsewhere in the spec.(14)
Partially completed primitives could be generated by the the Geometry Shader if the Geometry Shader ends and the primitive is incomplete. Incomplete primitives are silently discarded and no counters are incremented. This is similar to the way the IA treats Partially Completed Primitives(8.13).
To ensure consistent order of operations on an edge and primitive level for primitives that show up in multiple invocations of the Geometry Shader (as an adjacent primitive in some invocations, or the root primitive for one invocation), it is up to the application to write Shader code that traverses vertices in a consistent manner. This ordering can be obtained by a variety of methods, including simply sorting of vertices based on position in Shader code. A more robust ordering can be achieved by providing a vertex "coloring" (a number) as vertex attribute, such that for any primitive, the coloring is guaranteed to be unique for each vertex in the primitive. This method has the benefit that the sorting operation in the Geometry Shader is more efficient (and robust) than sorting xyz vertex positions. Colorings can be generated offline by an authoring tool.
The following registers are available in the gs_5_0 model:
Register Type | Count | r/w | Dimension | Indexable by r# | Defaults | Requires DCL |
---|---|---|---|---|---|---|
32-bit Temp (r#) | 4096 (r# + x#[n]) | r/w | 4 | n | none | y |
32-bit Indexable Temp Array (x#[n]) | 4096 (r# + x#[n]) | r/w | 4 | y | none | y |
32-bit Input (v[vertex][element]) | 32 | r | 4(comp)*32(vert) | y | none | y |
32-bit Input Primitive ID (vPrim) | 1 | r | 1 | n | none | y |
32-bit Input Instance ID (vInstanceID) | 1 | r | 1 | n | none | y |
Element in an input resource (t#) | 128 | r | 1 | n | none | y |
Sampler (s#) | 16 | r | 1 | n | none | y |
ConstantBuffer reference (cb#[index]) | 15 | r | 4 | y(contents) | none | y |
Immediate ConstantBuffer reference (icb[index]) | 1 | r | 4 | y(contents) | none | y |
Output Registers: | ||||||
NULL (discard result, useful for ops with multiple results) | n/a | w | n/a | n/a | n/a | n |
32-bit output Vertex Data Element (o#) | 32 | w | n/a | n/a | 4 | y |
The Geometry Shader must declare which type of primitive it expects as input, out of the set of choices: {point,line,triangle,line_adj,triangle_adj,1-32 control point patch list}. The input primitive type specifies the number of vertices that are present, and the vertices are always fully indexed (there is no declaration for vertex indexing range). Even if strips are being used earlier in the Pipeline, individual primitives cause Geometry Shader Invocations. See the GS Input Primitive Declaration Statement(22.3.6) in the instruction reference.
The following diagrams depict the layout of Geometry Shader Input Primitives into the input v# registers:
Chapter Contents
(back to top)
14.1 Mapping Streams to Buffers
14.2 Stream Output Buffer Declarations/Bindings
14.3 Stream Output Declaration Details
14.4 Current Stream Output Location
14.5 Tracking Amount of Data Streamed Out
14.6 Stream Output Buffer Bind Rules
14.7 Stream Output Is Orthogonal to Rasterization
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
The Pipeline can stream vertices out to memory just before clipping and rasterization (even if rasterization is still enabled). Vertices are always written out as complete primitives (e.g. 3 vertices at a time for triangles); incomplete primitives are never written out.
Just before Streaming Out, all topologies are always expanded to lists (i.e. if the topology is a triangle strip, it is expanded to a triangle list, having 3 vertices per primitive).
If the Geometry Shader is active, it is capable of producing outputs with up to 32 Elements per-vertex (each Element up to 4 components) for the Rasterizer, any subset of which can be routed to Stream Output. The presence of the GS allows multiple streams to be generated as well, as described here(13.5).
If the Geometry Shader is not active, whatever data arrives at the point in the pipeline where Stream Output appears (just before clipping and rasterization) can be Streamed Out (after expansion to a list topology as described above). Topologies with adjacency discard the "adjacent" vertices and only Stream Out the "interior" vertices. Patch topologies arriving at Stream Output can only go to Stream Output; the rasterizer must be disabled (undefined behavior otherwise).
In the expansion of strips to lists of primitives on Stream Output from the Geometry Shader, there is no notion of any data being able to be treated as "constant"; for each Geometry Shader output primitive (after expansion from a strip to a list), the vertices each originate from separate "emit"(22.8.3) instructions. Applications can still take advantage of this behavior to store primitive data, simply by relying on the fact that if streamed out geometry is recirculated back into the Pipeline in another pass, the Rasterizer will treat the Leading Vertex(8.14) in each primitive as the source for attributes that are declared as constant by the Pixel Shader.
A description of the distinction between a Stream and Buffer is given here(13.5). Up to 4 Streams can be present when the GS is used, otherwise there is a single Stream, Stream 0.
Stream Output can send data from any Stream to up to 4 Buffers simultaneously. The total number of output Buffers across all Streams is also constrained to 4. Data from multiple Streams cannot go to a single Buffer, but each Stream can send its output to multiple Buffers. Stream data cannot be replicated across multiple buffers.
Up to 128 scalar components of data per-vertex can be streamed out across the output Buffers, as long as the total window of data being output per-vertex to any one Buffer is 512 bytes or less. Vertex stride to a given Buffer can be up to 2048 bytes.
The mapping of data from Streams to where they are written in output Buffers appears in a declaration outlined further below.
In all cases, the only supported output data formats at Stream Output are 32-bit per component integer and floating point formats, with 1 to 4 components. This is not as general as the other Resource input/output paths in the D3D11.3 Pipeline. See the "Stream Output" column in the formats(19.1) table to see which formats can be used for Stream Output (all of which can of course be used at other parts of the D3D11.3 Pipeline for input). When any given 32-bit component of data in the Pipeline goes out the Stream Output path and gets written to memory, the hardware must simply dump out the 32 bits (per component) of data out unaltered, which is consistent with the sorts of formats supported for Stream Output described here.
The selection of which Elements to send to the Stream Output is tied to the Geometry Shader. When a Geometry Shader program is "Created" on the D3D11.3 Device, additional parameters can be passed into the "Create" call alongside the Geometry Shader code, describing both (a) what subset of data from the GS output to send to Stream Output for each of 1 to 4 Streams, (b) where to write the data to memory, (c) selection of 0 or 1 of the output Streams as going to the Rasterizer (indepdendent of it is going to Stream Output as well). If the Geometry Shader is not needed, but Stream Output functionality is desired, a "NULL" GS program can be specified, along with a Stream Output declaration for Stream 0 only, in which case whatever geometry reaches the GS stage of the pipeline gets Streamed out
The vertices in one Stream reaching the point in the pipeline just before the Rasterizer/clipping can be sent both to the Rasterizer (if the Pixel Shader is active) as well as to Stream Output if it is active, simultaneously. The Pixel Shader can consume any subset of the data reaching it, while Stream Output can simultaneously select any other (possibly overlapping) subset of the data.
The "NULL" GS + Stream Output scenario enables operations such as Streaming out the results of a VS. An application might wish to apply skinning to a vertex Buffer and save the results for reuse multiple times later. This may be accomplished by configuring a pipeline with a VS and a NULL GS (which just describes Stream Output). The vertex Buffer can be traversed by drawing a pointlist, in which case the VS will be invoked once for each vertex where skinning would be done, and then the Stream Output description can dump the result out to memory.
The CreateGeometryShaderWithStreamOutput() DDI is defined roughly as follows (exact details will vary; IHVs should defer to the reference codebase). The API differs in a few ways from this DDI, such as hiding the concept of "registers" and "masks" appearing below, instead using string names for elements in a shader output signature, and component counts / offets to identify data within elements.
typedef struct D3D11DDIARG_CREATEGEOMETRYSHADERWITHSTREAMOUTPUT { CONST DWORD* pShaderCode; CONST D3D11DDIARG_STREAM_OUTPUT_STREAM* pStreams; UINT NumStreams; CONST UINT* pBufferStrideInBytes; UINT NumStrides; } D3D11DDIARG_CREATEGEOMETRYSHADERWITHSTREAMOUTPUT; pShaderCode - The GS program. This can be NULL, which means there is no GS, but stream output is being defined (NumEntries must be > 0). NumStreams - How many Streams are being defined [0... 4] When set to 0, Stream Output is not being used (pShaderCode MUST have a GS in this case). A nonzero value defines the size of the Stream declaration array, pStreams. pBufferStrideInBytes - Array for each output Buffer, the spacing between the beginning of each vertex during stream output. The stride value must be >= the declared size of the stream output structure (including gaps), up to 2048 bytes max. Any amount in excess of the size of the stream output structure is untouched in memory during stream output. NumStrides - How many Buffers are being defined [0... 4] typedef struct D3D11DDIARG_STREAM_OUTPUT_STREAM { CONST D3D10DDIARG_STREAM_OUTPUT_DECLARATION_ENTRY* pOutputStreamDecl; UINT NumEntries; BOOL StreamToRasterizer; } D3D11DDIARG_STREAM_STREAM; NumEntries - Indicates how many entries are in the array at pStreamOutputDecl. This must be > 0, and defines how many Elements (including gaps between Elements in memory that aren’t touched) are being defined for Stream Output, per-vertex. Maximum count is 128 per Stream, with up to 4 Streams supported. pOutputStreamDecl - Array of NumEntries instances of the structure defined below. This array defines a contiguous sequence of up to 128 32-bit components of memory to get written per-vertex during Stream Output. Each declaration entry defines up to 4 components that either (a) come from one GS output register, or (b) are skipped (gap in output). Consecutive declaration entries define output memory contiguous to the previous entry. StreamToRasterizer - Whether this Stream is going to the Rasterizer. Only one stream can have this set to true. It is valid for no stream to set this true. If a Stream is going to the Rasterizer, it can also be sent to Stream Output as well (which is what pOutputStreamDecl above defines, indepenently). typedef struct D3D10DDIARG_SO_DECLARATION_ENTRY { UINT OutputSlot; // Which output buffer (slot) this is going out to. // outputSlot can only be [0..3]. UINT RegisterIndex; // This specifies which GS register to take output from. // The same register can appear multiple times in // the declaration (and do not have to appear // consecutively in the declaration), as long as the // RegisterMask does not overlap for repeated registers // within a Stream. Separate streams can overlap // output registers and component masks freely. // If there’s no GS, RegisterIndex refers to the // appropriate "register" from the previous active // Pipeline Stage's output. // There is no limit on the total number of unique // registers that can referenced (e.g. all 32 GS // output registers can be referenced), as long // as the amount of data doesn't exceed 128 32-bit // values. // A special RegisterIndex, 0xffffffff, represents // a gap in stream output. In this case, no data // from the pipeline is written out; instead the // components specified by RegisterMask are skipped in // the output (and the output memory is unchanged). // The only valid RegisterMask values for gaps are // are .x, .xy, .xyz or .xyzw, representing // gaps of 1, 2, 3 or 4 components, respectively. // Larger gaps are defined by chaining together // smaller gaps (at least at the DDI). DWORD RegisterMask;// Mask (i.e. xyzw mask) to apply to this “register” // coming from the Pipeline. This must be a subset of // the mask for the “register” in the source Pipeline // Stage’s output, and cannot have gaps between // components. To define gaps betwen components, // such as writing .xw, separate declaration // entries areused, e.g. for .xw, an entry for // .x, an entry for the gap, and an entry for .w. // // The width of the mask defines how much far the // Stream Output location advances. For example, if // the mask is .yzw, then Stream Output writes 32-bit*3 // yzw. // To accomplish complex layouts, such as swapping // component order or interleaving components from // multiple registers, and having gaps, multiple // declaration entries are used (allowing // Stream Output to be defined a component at a time). // // See RegisterIndex above for special behavior when // the register is set to 0xffffffff (gaps). // // RegisterMask cannot be empty. // // ------ // // Example scenario for RegisterMask: // Suppose - RegisterIndex is 10, and // - the GS declares o10.yzw for output. // // In this case, RegisterMask would be allowed only to be // the following, where (#) indicates how far in // multiples of 32 bits the stream output location // advances: // .y (1), .z (1), .w (1), .yz (2), .zw (2), .yzw (3). } D3D10DDIARG_SO_DECLARATION_ENTRY;
In order to use Stream Output, the application must:
Below is a very rough example (using pseudocode) of the sequence of operations an application might peform and how to calculate vertex counts.
What the Shader wants to do:
Suppose the GS needs to output:
float2 A
int4 B
float3 C
float3 D
The shader needs {A, B} to be output at one frequency as a point list.
{C, D} are to be output at another frequency as a point list.
A needs to go to buffer 0.
B needs to go to buffer 1.
A and B both need to go to the rasterizer as well.
C and D need to go to buffer 2.
The shader needs to output up to 100 of {A,B} and up to 70 of {C,D}, worst case 170 (100+70) emits total.
How this is accomplished by the application (basically by declaring exactly what is needed):
The Geometry Shader declares A and B into one stream (say stream 0), so emits of the data to stream 0 are done via emit(0). HLSL declares in the shader IL that A goes to o0.xy, B goes to o1.xyzw.
C and D are declared into another stream (stream 1), so emits to stream 1 are done via emit(1). HLSL declares in the shader IL that C goes to o0.xyz and D goes to o1.xyz.
The CreateGeometryShaderWithStreamOutput() call tags Stream 0 as going to the rasterizer.
Stream 0 and Stream 1 are declared as a point list topology (in fact whenever producing multiple streams, the only available topology is point list for each of them).
Vertices can be emitted to either stream in any order.
The shader code doesn’t need to know anything about the mapping of A,B,C,D to buffers/formats/memory layout. Like DX10, the buffer output declaration that accompanies the shader at CreateGeometryShaderWithBufferOut is responsible for those assignments and format definitions. This API validates stream constraints, like enforcing that outputs declared in different streams in the shader cannot be sent to the same buffer. In contrast, what this example does is valid – parts of a single output stream are split across multiple buffers.
The GS output declaration declares the max output vertex count as 170. As a result, shader compilation fails for this example! The reason is that the output vertex record size, based on the output declarations for the 2 streams, is the union of the declarations of each. Since stream 0 defines o0.xy and o1.xyzw, and stream 1 defines o0.xyz and o1.xyz, the union is {o0.xyz,o1.xyzw} = 7 scalars. 7 * 170 vertices = 1190, which is greater than 1024. If it happened that stream 1 also declared o0.xy and o1.xyzw (same as stream 0), the record size would have been 6 scalars, and 6*170 = 1020 which would have been valid.
Buffers used for Stream Output need to have a way to keep track of how full they are, in order to support the append ability and potentially to be able to invoke DrawAuto(8.9) without the CPU knowing how full the Buffer is at that time. See the Stream Output Pipeline Bind Flag for Buffers(5.3.4). This value is referred to as the BufferFilledSize. When the Buffer is newly created, the BufferFilledSize must equal 0.
In addition to structure definition (or type declaration for single Element Buffer) there is a mechanism for defining the starting offset into the Buffers where Shader outputs will start to be written. This offset is equivalent/equal to the BufferFilledSize associated with each Stream Output Buffer, since defining the starting offset also redefines the BufferFilledSize value. The next Draw() calls will begin streaming output data to the Buffer, starting at the offset, effectively appending data to the Buffer and accumulating the BufferFilledSize value associated with the Buffer. Subsequent Draw() calls continue to append to the location after the previous Draw() call finished. This is as if the starting offset were implicitly moved forward at the end of each Draw() call. The starting offset can also simply be reset to any location in the Buffer, overriding the implicit advancement after Draw() calls, and redefining the BufferFilledSize. When setting the Stream Output Buffer and starting Buffer offset, a reserved value for the starting Buffer offser (Ex. -1) is used to indicate to use the BufferFilledSize of the Buffer as the starting Buffer offset. This will allow a Stream Output Buffer to be appended to even if the Buffer is unbound from the Pipeline and bound back again later. So, these two call patterns would be identical:
SetStreamOutput( pBuffer, 0 ); // Buffer, & starting offset. Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize. Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize. Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize. SetStreamOutput( pBuffer, 0 ); // Buffer, & starting offset. Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize. SetStreamOutput( pBuffer, -1 ); // Buffer, & starting offset = pBuffer's BufferFilledSize Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize. SetStreamOutput( pBuffer, -1 ); // Buffer, & starting offset = pBuffer's BufferFilledSize Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize.
In order to monitor how much data the Pipeline has streamed out, there are a
some asynchronous queries: SO_STATISTICS(20.4.9)
and SO_OVERFLOW_PREDICATE(20.4.10)s.
In short, SO_STATISTICS provides a mechanism to retrieve values from two hardware counters for each Stream:
(a) UINT64 NumPrimitivesWritten = the number of primitives written to a Stream
(b) UINT64 PrimitiveStorageNeeded = the total number of primitives that would have been written given sufficient
storage for the Buffer(s) in a Stream.
Since the raw values of hardware counters are typically never useful, the popular usage of these counters is that
they will be sampled twice and then subtracted from each other. The NumPrimitivesWritten difference and PrimitiveStorageNeeded difference
will not be equal if the Draw() call(s), which were invoked between the two hardware counter sample points, generate more primitives than there
is space left in the smallest of the currently bound Buffer(s) to store them. Note there is only one NumPrimitivesWritten
counter per Stream even though it is possible to have multiple simultaneous Buffers bound for writing by a Stream. Stream Output is defined
to stop all writes to a Stream if one of the Buffers being written by that stream does not have room for another complete primitive.
The hardware always writes as many complete primitives (e.g. 3 vertices for a triangle) as possible to the Buffer(s) for a Stream; a given primitive is written only if there is enough space for its entire contents (e.g. 3 times the vertex stride for triangles must be available in the Buffer) in all the output Buffers for the Stream. If any Buffer for a Stream becomes full before the Draw() call has completed (i.e. no more space for a complete primitive to be appended), Shader execution continues, along with sustained incrementing of the PrimitiveStorageNeeded counter for that Stream, but not the NumPrimitivesWritten counter for that Stream. In addition, the Shader's outputs are no longer written to any of the output Buffers for that Stream. Output to other Streams functions independently.
An application can detect the overflow condition with the SO_OVERFLOW_PREDICATE(20.4.10). In particular, there are 4 + 1 predicates, one for each Stream, and an additional predicate that indicates if any of the 4 Streams has overflowed. These predicates can be used to mask future graphics commands to, for example, prevent a corrupted frame from being shown to the application. This could be useful when streaming unpredictable mounts of data out from the Geometry Shader.
If multiple Buffers are being written by a given Stream, as soon as one of the Buffers can no longer hold any more complete primitives, writes to ALL Buffers for that Stream are stopped, while as mentioned above, Shader execution continues, and the PrimitiveStorageNeeded counter continues to tally for that Stream. Other Streams operate independently.
If an output buffer slot (0..3) has data streamed out to it (as indicated by the stream output declaration), but no buffer is attached, then that output buffer slot is treated as if a full buffer is attached, resulting in the overflow behavior described here(14.5).
If an output buffer slot does not have data being streamed out to it, and a buffer is attached, then that buffer is fully ignored, including having no impact on overflow and output tracking.
The path through Rasterizer output is always available, even if Stream Output is active. When the Stream Output declaration is provided (created), the application must have indicated one of the output Streams as being enabled for Rasterization. This is covered in the DDI here(14.3).
Chapter Contents
(back to top)
15.1 Rasterizer State
15.2 Disabling Rasterization
15.3 Always Active: Clipping, Perspective Divide, Viewport Scale
15.4 Clipping
15.5 Perspective divide
15.6 Viewport
15.7 Scissor Test
15.8 Viewport and Scissor Controls
15.9 Viewport/Scissor State
15.10 Depth Bias
15.11 Cull State
15.12 IsFrontFace
15.13 Fill Modes
15.14 State Interaction With Point/Line/Triangle Rasterization Behavior
15.15 Per-Primitive RenderTarget Array Slice Selection
15.16 Rasterizer Precision
15.17 Conservative Rasterization
15.18 Axis-Aligned Quad Rasterization
Summary of Changes in this Chapter from D3D10 to D3D11.3
Back to all D3D10 to D3D11.3 changes.(25.2)
An Rasterizer overview is here(2.8). Many fundamental basics of Rasterizer operation are also provided in the Basics(3) section.
Vertices (x,y,z,w), coming to the Rasterizer, are assumed to be in homogenous clip-space. In this coordinate space the X axis points right, Y points up and Z points away from camera.
The meanings of the states are either self explanatory, or described further below.
typedef struct D3D11_RASTERIZER_DESC1 { D3D11_FILL_MODE FillMode; // described bleow D3D11_CULL_MODE CullMode; // described below BOOL FrontCounterClockwise; // do CCW primitive count as front for culling? UINT DepthBias; // described below float SlopeScaledDepthBias; // described below float DepthBiasClamp; // described below BOOL DepthClipEnable; // described below BOOL ScissorEnable; // described below BOOL MultisampleEnable; // see Line State(15.14.1) (the name Multisample is misleading; it affects lines only) BOOL AntialiasedLineEnable; // see Line State(15.14.1) UINT ForcedSampleCount; // see Target Independent Rasterization(3.5.6) } D3D11_RASTERIZER_DESC1;
Rasterizer state is encapsulated in a object, which once created can not be edited. Up to 4096 such objects can be created on a given device context.
The reason for the limit on number immutable Rasterizer State objects that can be created is to enable hardware to maintain references to multiple of these in flight in the Pipeline without having to track changes or flush the Pipeline, which would be necessary if rasterizer state were allowed to be edited.
Rasterization is disabled when the following are all true:
There is NO facility in D3D11 for disabling clipping of X and Y coordinates, the viewport scale, or the perspective divide if the rasterizer is enabled. Clipping of the Z coordinates can be disabled by setting the DepthClipEnable Rasterizer State(15.1) to FALSE.
Note that this means there is no way for an application to directly pass RenderTarget-space coordinates for vertices. Vertex positions are always assumed to be in normalized space, so the Viewport transformation must always be relied upon to map to specific pixel locations.
In clip space primitives are clipped to the following volume:
0 < w
-w <= x <= w (or arbitrarily wider range if implementation uses a guard band to reduce clipping burden)
-w <= y <= w (or arbitrarily wider range if implementation uses a guard band to reduce clipping burden)
0 <= z <= w
By default primitives are clipped to a volume that includes a 0 <= z <= w depth range clip. Clipping of the Z coordinates can be disabled by setting the DepthClipEnable Rasterizer State(15.1) to FALSE. Primitives that fall outside of the depth range are thus still rendered, but are given the value of the nearest limit of the viewport depth range. Even when Z clipping is disabled, primitives must be clipped such that only w > 0 vertices result. Coordinates coming in to clipping with infinities at x,y,z may or may not result in a discarded primitive. Coordinates with NaN at x,y,z or w coming out of clipping are discarded.
The reason to allow disabling depth clip is that it causes problems for applications such as stencil shadows, necessitating complex code to draw end-caps on geometry that exceeds the depth range. When Z clipping is disabled, primitives may not be correctly depth-sorted at the pixel level, but this is unimportant for some applications (and can be dealt with via painter's algorithm).
There are no restrictions to the range of input vertex coordinates to clipping. Clipping operations are performed using at least float32 precision, and accordingly NaNs and infinities are processed using the floating point rules.
Two additional mechanisms for slicing geometry against application defined planes are provided, similar to each other in programming method but different in behavior:
(a) A method for clipping primitives against a plane at the rasterization level (i.e. allowing for intersection within an individual primitive)
(b) A method for culling primitives if all vertices are on the "out" side of of a plane.
These mechanisms, dubbed "Clip Distances" and "Cull Distances" respectively, are described below.
To enable primitive setup / rasterizer to perform clipping against arbitrary planes defined by the application, vertex component(s) can be identified as the System Interpreted Value(4.4.5) "clipDistance". When component(s) of vertex Element(s) are identified this way, these values are each assumed to be a float32 signed distance to a plane. Primitive setup only invokes rasterization on pixels for which the interpolated plane distance(s) are >= 0.
Multiple clip planes can be implemented simultaneously, by declaring multiple component(s) of one or more vertex elements as the System Interpreted Value "clipDistance".
When multisampling, implementations MUST clip against clip distances at subsample resolution.
If a vertex has a clip distance of NaN, the primitives containing that vertex are discarded.
For further information about "clipDistance", see its listing(24.1) in the System Interpreted Values reference.
To enable rough primitive-level culling against arbitrary planes defined by the application, vertex component(s) can be identified as System Interpreted Value(4.4.5) "cullDistance". When component(s) of vertex Element(s) are given this label, these values are each assumed to be a float32 signed distance to a plane. Primitives will be completely discarded if the plane distance(s) for all of the vertices in the primitive are are < 0. Said another way, if any of the plane distance(s) (data labeled as the System Interpreted Value "cullDistance") in a primitive is >= 0, the primitive is not culled (though other culling such as backface culling could still occur and is orthogonal).
Multiple cull planes can be used simultaneously, by declaring multiple component(s) of one or more vertex elements as the System Interpreted Value "cullDistance".
Since cullDistance culling can be done simply by looking at vertices, this can be more efficient (though more coarse) than using clipDistances, which must be able to operate at rasterization level, without having to enable a path in the Rasterizer for clipping within primitives.
If a vertex has a cull distance of NaN, that vertex counts as "out" (as if it is < 0).
For further information about "cullDistance", see its listing(24.2) in the System Interpreted Values reference.
At most 8 components in at most 2 vertex elements may be defined as System Interpreted Values "clipDistance" or "cullDistance".
For a given primitive with one or multiple components labeled as System Interpreted Value "cullDistance", the rejection test (primitive rejected if all distances < 0) is applied using all vertices for each cullDistance component, and if the primitive is rejected by any one or more of the tests it is discarded.
After cullDistance processing is complete, for remaining primitives going into rasterization setup, if there are one or multiple components labeled as System Interpreted Value "clipDistance", any region(s) of a primitive that result in one or more of the clipDistances being < 0 after interpolation are not rasterized.
Inside the Pixel Shader it is valid to declare input Element(s) labeled as System Interpreted Values "clipDistance" and "cullDistance", in which case the appropriately interpolated clip distances or cull distances show up, as expected.
The interpolation mode declared(22.3.10) by the Pixel Shader on any input v# register labeled as System Interpreted Value "clipDistance" must be D3DINTERPOLATION_LINEAR. No such limitation exists for input v# registers labeled as System Interpreted Value "cullDistance"; these can be interpolated any way into the Pixel Shader.
Note that clip/cull distances have no effect on GS stream output if it is active. The clip/cull can be thought of as appearing after the stream output in the Pipeline.
After clipping, position X,Y,Z coordinates and non-constant vertex attributes with interpolation mode linear (meaning with perspective), are divided by the position W value.
Viewports map clip-space vertex positions into RenderTarget space. In the RenderTarget space Y axes points down, so the Y coordinates are flipped during the viewport scale. Multiple Viewports can be made available simultaneously, so that primitives can choose their one (see Viewport Index(15.8.1)), however the basic case is to simply use a single Viewport for all rendering in a particular scene. Only one Viewport can ever apply to an individual primitive being rasterized.
Viewport extents are specified as int32 values (except Z extents which are float32). Operations using all of the extents are done with float32 arithmetic (int32 extents converted to float32).
There is always an implicit scissoring by the Viewport x/y extents, orthogonal to other Scissor(15.7) state. In other words, regardless of whether or not an implementation has a guard band in its clipper or not, rendering will never touch any area outside the Viewport's x/y extents (except a small nondeterministic region that appears if the viewport left and top extents have fractional coordinates, discussed in the Viewport Range(15.6.1) section).
If a Viewport has not been set, then the default is a Viewport with all extents 0: {0,0,0,0,0.0f,0.0f}. When RenderTargets change, there is no automatic update of the Viewport.
Viewport scale is performed using float32 arithmetic according to the following formulas:
Xrt= (X + 1) * Viewport.Width * 0.5 + Viewport.TopLeftX
Yrt= (1 - Y) * Viewport.Height * 0.5 + Viewport.TopLeftY
Zrt= Viewport.MinDepth + Z * (Viewport.MaxDepth - Viewport.MinDepth)
An additional effect of the Viewport is that in the Output Merger, just before the final rounding of z to depth-buffer format before depth compare, the z value is always clamped: z = min(Viewport.MaxDepth,max(Viewport.MinDepth,z)), in compliance with D3D11 Floating Point Rules(3.1) for min and max. This clamping occurs regardless of where z came from: out of interpolation, or from z output by the Pixel Shader (replacing the interpolated value). Z input to the Pixel Shader is not clamped (since the clamp described here occurs after the Pixel Shader).
Viewport MinDepth and MaxDepth must both be in the range [0.0f...1.0f], and MinDepth must be less-than or equal-to MaxDepth.
The Rasterizer must support(15.16) fixed-point x,y positions after Viewport scale with 16.8 precision (approximately [-32768…32767] range). As such D3D11 defines the following constraints on the float Viewport Width, Height, TopLeftX and TopLeftY parameters:
-32768 <= Viewport.TopLeftX <= 32767
-32768 <= Viewport.Width + Viewport.TopLeftX <= 32767
-32768 <= Viewport.TopLeftY <= 32767
-32768 <= Viewport.Height + Viewport.TopLeftY <= 32767
Viewport parameters are validated in the runtime such that values outside these ranges will never be passed to the DDI.
The runtime validates the parameters to be in valid range, skipping the call if there is an error (the DDI will never see invalid parameters).
The behavior of the implicit scissor to the viewport with fractional viewport extents is described in the Scissor(15.7) section (basically rounding X and Y to negative infinity to get integers).
Observe that when the viewport location is fractional, which results in rounding to determine the implicit scissor, there is effectively a non-deterministic zone of up to 1/2 pixel wide along the left and top edges within the scissor area, not covered by the viewport. Because it is optional for implementations to perform guard-band clipping to viewport extents, and even if they do, implementations of it could vary, this means that rendering results in the non-deterministic zone will be some undefined combination of background values and primitives that may or may not have been clipped off the zone.
If an application needs to avoid artifacts from this non-deterministic zone, one approach is to simply never use fractional viewport extents. Another approach, if fractional viewports are needed, is to always subtract 1 from the intended viewport TopLeftX and TopLeftY, while adding 1 to the intended Viewport Width and Height, then defining the Scissor extents over the intended pixel area. This will crop out the non-deterministic zone and allow fractional viewports that, for example, smoothly move the inside contents (even thought the extents are rounded), without any non-deterministic rendering.
Scissor cuts out a rectangle in RenderTarget space where pixels are permitted to appear. Any pixel outside these extents is discarded. Multiple Scissor rectangles can be active simultaneously, from which individual primitives can choose one (see Selecting Viewport/Scissor(15.8.1) below). Only one scissor rectangle can ever apply to an individual primitive being rasterized, though this does not count the implied scissoring that is always applied to the Viewport(15.6)'s x/y extents.
Scissor extents are specified in unsigned integer, with no limits on the magnitudes of the extents. If the Scissor rectangle falls off the currently set RenderTargets, then simply nothing will get drawn. If the Scissor rectangle is larger than the currently set RenderTarget(s) or straddles an edge, then the only pixels that can be drawn are the ones in the covered area of the RenderTarget(s). The Scissor can be enabled or disabled (all Scissors together) using the Rasterizer State(15.1) ScissorEnable. If disabled, any pixel on the RenderTarget(s) can be drawn to. The default Scissor Rectangle is an empty Scissor Rectangle: {0,0,0,0}.
The implicit scissor to the viewport (mentioned in the Viewport(15.6) section) rounds the viewport X and Y extents to negative infinity. This way the scissor extents are always integers. The rounding to derive scissor extents applies to the locations where the fractional left/right/top/bottom edges would be after the float viewport transform. E.g. the viewport width and height cannot be rounded; they must be added to unrounded TopLeftX and TopLeftY to determine the right and bottom extents, which then get rounded to determine the scissor extents.
There is a set of 16 Viewports and Scissor rects that can be set active via the API/DDI. By default, the 0-th Viewport and Scissor settings are used during rasterization setup. But Viewports can be selected on a per-primitive basis from the Geometry Shader by naming a component of GS output vertex data "ViewportArrayIndex"(24.5). "ViewportArrayIndex", taken from the Leading Vertex(8.14) for a primitive, is interpreted as a 32-bit unsigned integer value, with meaningful values in the range [0 and n-1] (where n is the maximum number of viewports allowed). Values outside [0..n-1] are treated as 0 for indexing viewports. Should the Pixel Shader input "ViewportArrayIndex", whatever value "ViewportArrayIndex" was given shows up unmodified/unclamped in the Shader (even if out of [0..n-1] range).
If the Geometry Shader is not used, the default 0-th Viewport and Scissor settings are used.
typedef struct D3D11_VIEWPORT { float TopLeftX; float TopLeftY; /* Viewport Top left */ float Width; float Height; /* Viewport Dimensions */ float MinDepth; /* Min/max of clip Volume */ float MaxDepth; } D3D11_VIEWPORT; typedef struct D3D11_RANGE { SIZE_T Start; SIZE_T End; /* One past end; Size = ( End - Start ) */ } D3D11_RANGE; typedef struct D3D11_RECT { D3D11_RANGE X; D3D11_RANGE Y; } D3D11_RECT; typedef struct D3D11_BOX { D3D11_RANGE X; D3D11_RANGE Y; D3D11_RANGE Z; } D3D11_BOX; SetViewports(UINT NumViewports, const D3D11_VIEWPORT *pViewports); /* NumViewports: 0 - 15 */ SetScissorRects(UINT NumRects, const D3D11_RECT *pRects); /* NumRects: 0 - 15 */
Rasterizer State(15.1) defining Depth Biasing: INT DepthBias float SlopeScaledDepthBias float DepthBiasClamp Formulas: MaxDepthSlope = max(abs(dZ/dX),abs(dz/dy)) // approximation of max depth // slope for polygon if( SlopeScaledDepthBias != 0 ) SlopeScaledDepthBias = SlopeScaledDepthBias * MaxDepthSlope; // Above: only doing SlopeScaledDepthBias math when nonzero to avoid // a 0*INF = NaN scenario with edge-on wireframe triangles. // Previously in the D3D10 spec, hardware was erroneously spec'd to // unconditionally multiply SlopeScaledDepthBias with MaxDepthSlope. // The new behavior defined here applies to any new hardware regardless // of what D3D API or feature level it is running against. When UNORM Depth Buffer is at Output Merger (or no Depth Buffer): Bias = (float)DepthBias * r + SlopeScaledDepthBias Where r is the minimum representable value > 0 in the depth buffer format, converted to float32. When Floating Point Depth Buffer at Output Merger: Bias = (float)DepthBias * 2^(exponent(max abs(z) in primitive) - r) + SlopeScaledDepthBias Where r is the # of mantissa bits in the floating point representation (excluding the hidden bit), e.g. 23 for float32. Adding Bias to z: if(DepthBiasClamp > 0) Bias = min(DepthBiasClamp, Bias) else if(DepthBiasClamp < 0) Bias = max(DepthBiasClamp, Bias) // else if DepthBiasClamp == 0, no clamping occurs if ( (DepthBias != 0) || (SlopeScaledDepthBias != 0.) ) z = z + Bias
Biasing is constant for a given primitive, with the same value added to the z for each vertex before interpolator setup.
The biasing formulas are performed with float32 arithmetic.
Depth Bias is not applied to any point or line primitives, except for lines drawn in wireframe mode as described in the Fill Modes(15.13) section.
Depth Bias is disabled by setting both DepthBias and SlopeScaledDepthBias to zero, in which case the depth value is unmodified. Note that this disables propagation of IEEE specials that may be generated if the operation is performed even with zero DepthBias and SlopeScaledDepthBias values.
Comments on one of the usage scenarios for Depth Biasing:
One of the artifacts with shadow buffer based shadows is “shadow acne”, or a surface shadowing itself in a spotty way because of inexactness in computing the depth of a surface from the shader to be compare against the depth of the same surface in the shadow buffer. A way to alleviate this is to use DepthBias and SlopeScaledDepthBias when rendering a shadow buffer. The intent is to push surfaces out enough when rendering a shadow buffer so that when compared against themselves via shader-computed z during the shadow test, the comparison result is consistent across the surface, and local-self-shadowing is avoided.
However, using DepthBias and SlopeScaledDepthBias alone introduces a few of its own artifacts, where an extremely steep polygon causes the bias equation to explode, pushing the polygon extremely far away from the originating surface in the shadow map. Consider a steep face, with respect to a light, that gets pushed away extremely far in relation to the dimensions of the parent object by Depth Biasing. Suppose this face is surrounded by shallower faces which the Bias equation pushed out much, much less. The resulging shadow map has a huge discont