Direct3D 11.3 Functional Specification

Version 1.16 - 4/23/2015


Full Table of Contents at end of document.

Condensed Table of Contents

  • 1 Introduction
  • 2 Rendering Pipeline Overview
  • 3 Basics
  • 4 Rendering Pipeline
  • 5 Resources
  • 6 Multicore
  • 7 Common Shader Internals
  • 8 Input Assembler Stage
  • 9 Vertex Shader Stage
  • 10 Hull Shader Stage
  • 11 Tessellator
  • 12 Domain Shader Stage
  • 13 Geometry Shader Stage
  • 14 Stream Output Stage
  • 15 Rasterizer Stage
  • 16 Pixel Shader Stage
  • 17 Output Merger Stage
  • 18 Compute Shader Stage
  • 19 Stage-Memory I/O
  • 20 Asynchronous Notification
  • 21 System Limits on Various Resources
  • 22 Shader Instruction Reference
  • 23 System Generated Values Reference
  • 24 System Interpreted Values Reference
  • 25 Appendix
  • 26 Constant Listing (Auto-generated)



  • 1 Introduction


    Chapter Contents

    (back to top)

    1.1 Purpose
    1.2 Audience
    1.3 Topics Covered
    1.4 Topics Not Covered
    1.5 Not Optimized for Smooth Reading
    1.6 How D3D11.3 Fits into this Unified Spec


    1.1 Purpose

    This document describes hardware requirements for Direct3D 11.3 (D3D11.3).

    1.2 Audience

    It is assumed that the reader is familiar with real-time graphics, modern Graphics Processing Unit (GPU) design issues and the general architecture of Microsoft Windows Operating Systems, as well their planned release roadmap.

    The target audience for this spec are the implementers, testers and documenters of hardware or software components that would be considered part of a D3D11.3-compliant system. In addition, software developers who are vested in the details about medium-term GPU hardware direction will find interesting information.

    1.3 Topics Covered

    Topics covered in this spec center on definition of the hardware architecture being targeted by the D3D11.1 Graphics Pipeline, in a form that attempts to be agnostic to any single vendor's hardware implementation. Included will be some references to how the Graphics Pipeline is controlled through a Device Driver Interface (DDI), and occasionally depictions of API usage as needed to illustrate points.

    Occasionally, boxed text such as this appears in the spec to indicate justification for decisions, explain history about a feature, provide clarifications or general remarks about a topic being described, or to flag an unresolved issues. These shaded boxes DO NOT provide a complete listing of all such trivia, however. Note that on each revision of this spec, all changes made for that revision are summarized in a separate document typically distributed with the spec.

    1.4 Topics Not Covered

    The exact relationship and interactions between topics covered in the Graphics Pipeline with other Operating System components is not covered.

    GPU resource management, GPU process scheduling, and low-level Operating System driver/kernel architecture are not covered.

    High-level GPU programming concepts (such as high level shading languages) are not covered.

    Little to no theory or derivation of graphics concepts, techniques or history is provided. Equally rare for this spec is any attempt to characterize what sorts of things applications software developers might do using the functionality provided by D3D11.3. There are exceptions, but do not expect to gain much more than an understanding of the "facts" about D3D11.3 from this spec.

    1.5 Not Optimized for Smooth Reading

    Beware, there is little flow to the content in this spec, although there are plenty of links from place to place.


    1.6 How D3D11.3 Fits into this Unified Spec

    This document is the product of starting with the full D3D11.2 functional spec and adding in relevant WindowsNext D3D11.3 features.

    Each Chapter in this spec begins with a summary of the changes from D3D10 to D3D10.1 to D3D11 to D3D11.1 to D3D11.2 to D3D11.3 for that Chapter. A table of links to all of the Chapter delta summaries can be found here(25.2).

    To find D3D11.3 changes specifically (which includes changes for optional new features and clarifications/corrections that affect all feature levels, look for "[D3D11.3]" in the chapter changelists (or simply search the doc for it).



    2 Rendering Pipeline Overview


    Chapter Contents

    (back to top)

    2.1 Input Assembler (IA) Overview
    2.2 Vertex Shader (VS) Overview
    2.3 Hull Shader (HS) Overview
    2.4 Tessellator (TS) Overview
    2.5 Domain Shader (DS) Overview
    2.6 Geometry Shader (GS) Overview
    2.7 Stream Output (SO) Overview
    2.8 Rasterizer Overview
    2.9 Pixel Shader (PS) Overview
    2.10 Output Merger (OM) Overview
    2.11 Compute Shader (CS) Overview


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)


    D3D11.1 hardware, like previous generations, can be designed with shared programmable cores. A farm of Shader cores exist on the GPU, able to be scheduled across the functional blocks comprising the D3D11.1 Pipeline, depicted below.


    2.1 Input Assembler (IA) Overview

    The Input Assembler (IA) introduces triangles, lines, points or Control Points (for Patches) into the graphics Pipeline, by pulling source geometry data out of 1D Buffers(5.3.4).

    Vertex data can come from multiple Buffers, accessed in an "Array-of-Structures" fashion from each Buffer. The Buffers are each bound to an individual input slot and given a structure stride. The layout of data across all the Buffers is specified by an Input Declaration, in which each entry defines an "Element" with: an input slot, a structure offset, a data type, and a target register (for the first active Shader in the Pipeline).

    A given sequence of vertices is constructed out of data fetched from Buffers, in a traversal directed by a combination of fixed-function state and various Draw*() API/DDI calls. Various primitive topologies are available to make the sequence of vertex data represent a sequence of primitives. Example topologies are: point-list, line-list, triangle-list, triangle-strip, 8 control-point patch-list.

    Vertex data can be produced in one of two ways. The first is "Non-Indexed" rendering, which is the sequential traversal of Buffer(s) containing vertex data, originating at a start offset at each Buffer binding. The second method for producing vertex data is "Indexed" rendering, which is sequential traversal of a single Buffer containing scalar integer indices, originating at a start offset into the Buffer. Each index indicates where to fetch data out of Buffer(s) containing vertex data. The index values are independent of the characteristics of the Buffers they are referring to; Buffers are described by a declaration as mentioned earlier. So the task accomplished by "Non-Indexed" and "Indexed" rendering, each in their own way, is producing addresses from which to fetch vertex data in memory, and subsequently assemble the results into vertices and primitives.

    Instanced geometry rendering is enabled by allowing the sequential traversal, in either Non-indexed or Indexed rendering, to loop over a range within each Vertex Buffer (Non-Indexed case) or Index Buffer (Indexed case). Buffer-bindings can be identified "Instance Data" or "Vertex Data", indicating how to use the bound Buffer while performing instanced rendering. The address generated by "Non-Indexed" or "Indexed" rendering is used to fetch "Vertex Data", accounting also for looping when doing Instanced rendering. "Instance Data", on the other hand, is always sequentially traversed starting from a per-Buffer offset, at a frequency equal to one step per instance (e.g. one step forward after the number of vertices in an instance are traversed). The step rate for "Instance Data" can also be chosen to be a subharmonic of the instance frequency (i.e. one step forward every other instance, every third instance etc.).

    Another use of the Input Assembler is that it can read Buffers that were written to from the Stream Output(2.7) stage. Such a scenario necessitates a particular type of Draw, DrawAuto(8.9). DrawAuto enables the Input Assembler to know how much data was dynamically written to a Stream Output Buffer without CPU involvement.

    In addition to producing vertex data from Buffers, the IA can auto-generate scalar counter values such as: VertexID(8.16), PrimitiveID(8.17) and InstanceID(8.18), for input to shader stages in the graphics pipeline.

    In "Indexed" rendering of strip topologies, such as triangle strips, a mechanism is provided for drawing multiple strips with a single Draw*() call (i.e. 'cut'ting strips).

    Specific operational details of the IA are provided here(8).


    2.2 Vertex Shader (VS) Overview

    The Vertex Shader stage processes vertices, performing operations such as transformations, skinning, and lighting. Vertex Shaders always operate on a single input vertex and produce a single output vertex. This stage must always be active.

    Specific operational details of Vertex Shaders are provided here(9).


    2.3 Hull Shader (HS) Overview

    The Hull Shader operates once per Patch (can only be used with Patces from the IA). It can transform input Control Points that make up a Patch into Output Control Points, and it can perform other setup for the fixed-function Tessellator stage (outputting TessFactors, which are numbers that indicate how much to tessellate).

    Specific operational details of the Hull Shader are provided here(10).


    2.4 Tessellator (TS) Overview

    The Tessellator is a fixed function unit whose operation is defined by declarations in the Hull Shader. It operates once per Patch output by the Hull Shader. The Hull shader outputs TessFactors which are numbers that tell the Tessellator how much to tessellate (generate geometry and connectivity) over the domain of the Patch.

    Specific operational details of the Tessellator provided here(11).


    2.5 Domain Shader (DS) Overview

    The Domain Shader is invoked once per vertex generated by the Tessellator. Each invocation is identified by its coordinate on a generic domain, and the role of the Domain Shader is to turn that coordinate into something tangible (such as a point in 3D space) for use downstream. Each Domain Shader invocation for a Patch also sees shared input of all the Hull Shader output (such as output Control Points).

    Specific operational details of the Domain Shader are provided here(12).


    2.6 Geometry Shader (GS) Overview

    The Geometry Shader runs application-specified Shader code with vertices as input and the ability to generate vertices on output. The Geometry Shader's inputs are the vertices for a full primitive (two vertices for lines, three vertices for triangles, a single vertex for point, or all Control Points for a Patch if it reaches the GS with Tessellation disabled). Some types of primitives can also include the vertices of edge-adjacent primitive (an additional two vertices for a line, an additional three for a triangle).

    Another input is a PrimitiveID auto-generated by the IA. This allows per-face data to be fetched or computed if desired.

    The Geometry Shader stage is capable of outputting multiple vertices forming a single selected topology (GS output topologies available are: tristrip, linestrip, pointlist). The number of primitives emitted can vary freely within any invocation of the Geometry Shader, though the maximum number of vertices that could be emitted must be declared statically. Strip lengths emitted from a GS invocation can be arbitrary (there is a 'cut'(22.8.1) command).

    Output may be fed to rasterizer and/or out to vertex Buffers in memory. Output fed to memory is expanded to individual point/line/triangle lists (the same way they would get passed to the rasterizer).

    Algorithms that can be implemented in the Geometry Shader include:

    Specific operational details of the Geometry Shader are provided here(13).


    2.7 Stream Output (SO) Overview

    Vertices may be streamed out to memory just before arriving at the Rasterizer. This is like a "tap" in the Pipeline, which can be turned on even as data continues to flow down to the Rasterizer. Data sent out via Stream Output is concatenated to Buffer(s). These Buffers may on subsequent passes be recirculated as Pipeline inputs.

    One constraint about Stream Output is that it is tied to the Geometry Shader, in that both must be created together (though either can be "NULL"/"off"). The particular memory Buffer(s) being Streamed out are not tied to this GS/SO pair though. Only the description of which parts of vertex data to feed to Stream Output are tied to the GS.

    One use for Stream Output is for saving ordered Pipeline data that will be reused. For example a batch of vertices might be "skinned" by passing the vertices into the Pipeline as if they are independent points (just to visit all of them once), applying "skinning" operations on each vertex, and streaming out the results to memory. The saved out "skinned" vertices are now available for use in subsequent passes as input.

    Since the amount of output written through Stream Output can be unpredictably dynamic, a special type of Draw command, DrawAuto(8.9), is necessary. DrawAuto enables the Input Assembler to know how much data was dynamically written to a Stream Output Buffer without CPU involvement. In addition, Queries are necessary to mitigate Stream Output overflow(20.4.10), as well as retrieve how much data was written(20.4.9) to the Stream Output Buffers.

    Specific operational details of the Stream Output are provided here(14).


    2.8 Rasterizer Overview

    The rasterizer is responsible for clipping, primitive setup, and determining how to invoke Pixel Shaders. D3D11.3 does not view this as a "stage" in the Pipeline, but rather an interface between Pipeline stages which happens to perform a significant set of fixed function operations, many of which can be adjusted by software developers.

    The rasterizer always assumes input positions are provided in clip-space, performs clipping, perspective divide and applies viewport scale/offset.

    Specific operational details of the Rasterizer are provided here(15).


    2.9 Pixel Shader (PS) Overview

    Input data available to the Pixel Shader includes vertex attributes that can be chosen, on a per-Element basis, to be interpolated with or without perspective correction, or be treated as constant per-primitive.

    The Pixel Shader can also be chosen to be invoked either once per pixel or once per covered sample within the pixel.

    Outputs are one or more 4-vectors of output data for the current pixel or sample, or no color (if pixel is discarded).

    The Pixel Shader has some other inputs and outputs available as well, similar to the kind of inputs and outputs the Compute Shader can use, allowing, for instance, the ability to write to scattered locations.

    Specific operational details of Pixel Shaders are provided here(16).


    2.10 Output Merger (OM) Overview

    The final step in the logical Pipeline is visibility determination, through stencil or depth, and writing or blending of output(s) to RenderTarget(s), which may be one of many Resource Types(5).

    These operations, as well as the binding of output resources (RenderTargets), are defined at the Output Merger.

    Specific operational details of the Output Merger are provided here(17).


    2.11 Compute Shader (CS) Overview

    The Compute Shader allows the GPU to be viewed as a generic grid of data-parallel processors, without any graphics baggage from the graphics pipeline. The Compute Shader has explicit access to fast shared memory to facilitate communication between groups of shader invocations, and the ability to perform scattered reads and writes to memory. The availablility of atomic operations enables unique access to shared memory addresses. The Compute Shader is not part of the Graphics Pipeline (all the previously discussed shader stages). The Compute Shader exists on its own, albeit on the same device as all the other Shader Stages. To invoke this shader, Dispatch*() APIs are called instead of Draw*().

    Specific operational details of Compute Shaders are provided here(18).


    3 Basics


    Chapter Contents

    (back to top)

    3.1 Floating Point Rules
    3.2 Data Conversion
    3.3 Coordinate Systems
    3.4 Rasterization Rules
    3.5 Multisampling


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)


    3.1 Floating Point Rules


    Section Contents

    (back to chapter)

    3.1.1 Overview
    3.1.2 Term: Unit-Last-Place (ULP)
    3.1.3 32-bit Floating Point

    3.1.3.1 Partial Listing of Honored IEEE-754 Rules
    3.1.3.2 Complete Listing of Deviations or Additional Requirements vs. IEEE-754
    3.1.4 64-bit (Double Precision) Floating Point
    3.1.5 16-bit Floating Point
    3.1.6 11-bit and 10-bit Floating Point


    3.1.1 Overview

    D3D11 supports several different floating point representations for storage. However, all floating point computations in D3D11, whether in Shader programs written by application developers or in fixed function operations such as texture filtering or RenderTarget blending, are required to operate under a defined subset of the IEEE 754 32-bit single precision floating point behavior.

    3.1.2 Term: Unit-Last-Place (ULP)

    One ULP is the smallest representable delta from one value in a numeric representation to an adjacent value. The absolute magnitude of this delta varies with the magnitude of the number in the case of a floating point number. If, hypothetically, the result of an arithmetic operation were allowed to have a tolerance 1 ULP from the infinitely precise result, this would allow an implementation that always truncated its result (without rounding), resulting in an error of at most one unit in the last (least significant) place in the number representation. On the other hand, it would be much more desirable to require 0.5 ULP tolerance on arithmetic results, since that requires the result be the closest possible representation to the infinitely precise result, using round to nearest-even.

    3.1.3 32-bit Floating Point

    3.1.3.1 Partial Listing of Honored IEEE-754 Rules

    Here is a summary of expected 32-bit floating point behaviors for D3D11. Some of these points choose a single option in cases where IEEE-754 offers choices. This is followed by a listing of deviations or additions to IEEE-754 (some of which are significant). Refer to IEEE-754 for topics not mentioned.

    3.1.3.2 Complete Listing of Deviations or Additional Requirements vs. IEEE-754

    3.1.4 64-bit (Double Precision) Floating Point

    Double-precision floating-point support is optional, however all double-precision floating point instructions listed in this spec (here (arithmetic)(22.14), here (conditional)(22.15), here (move)(22.16) and here (type conversion)(22.17) ) must be implemented if double support is enabled.

    Double-precision floating-point usage is indicated at compile time by declaring shadel model 5_a. Support for Shader Model 5.0a will be reportable by drivers and discoverable by users via an API.

    When supported, double-precision instructions match IEEE 754R behavior requirements (with the exception of double precision reciprocal(22.14.5) which is permitted 1.0 ULP tolerance and the exact result if representable).

    An exception to the 4-vector register convention exists for double-precision floating-point instructions, which operate on pairs of doubles. Double-precision floating-point values are in IEEE 754R format. One double is stored in .xy with the least significant 32 bits in x, and the most significant 32 bits in y. Similarly the second double is stored in .zw with the least significant 32 bits in z, and the most significant 32 bits in w.

    The permissible swizzles for double operations are .xyzw, .xyxy, .zwxy, .zwzw. The permissible write masks for double operations are .xy, .zw, and .xyzw.

    Support for generation of denormalized values is required for double-precision data (no flush-to-zero behavior). Likewise, instructions do not read denormalized data as a signed zero - they honor the denorm value.

    3.1.5 16-bit Floating Point

    Several resource formats in D3D11 contain 16-bit representations of floating point numbers. This section describes the float16 representation.

    Format:

    A float16 value, v, made from the format above takes the following meaning:

    32-bit floating point rules also hold for 16-bit floating point numbers, adjusted for the bit layout described above.

    The exceptions are:

    3.1.6 11-bit and 10-bit Floating Point

    A single resource format in D3D11 contains 11-bit and 10-bit representations of floating point numbers. This section describes the float11 and float10 representations.

    Format:

    A float11/float10 value, v, made from the format above takes the following meaning:

    32-bit floating point rules also hold for 11-bit and 10-bit floating point numbers, adjusted for the bit layout described above.

    The exceptions are:


    3.2 Data Conversion


    Section Contents

    (back to chapter)

    3.2.1 Overview
    3.2.2 Floating Point Conversion
    3.2.3 Integer Conversion

    3.2.3.1 Terminology
    3.2.3.2 Integer Conversion Precision
    3.2.3.3 SNORM -> FLOAT
    3.2.3.4 FLOAT -> SNORM
    3.2.3.5 UNORM -> FLOAT
    3.2.3.6 FLOAT -> UNORM
    3.2.3.7 SRGB -> FLOAT
    3.2.3.8 FLOAT -> SRGB
    3.2.3.9 SINT -> SINT (With More Bits)
    3.2.3.10 UINT -> SINT (With More Bits)
    3.2.3.11 SINT -> UINT (With More Bits)
    3.2.3.12 UINT -> UINT (With More Bits)
    3.2.3.13 SINT or UINT -> SINT or UINT (With Fewer or Equal Bits)
    3.2.4 Fixed Point Integers
    3.2.4.1 FLOAT -> Fixed Point Integer
    3.2.4.2 Fixed Point Integer -> FLOAT


    3.2.1 Overview

    This section describes the rules for various data conversions in D3D11. Other relevant information regarding data conversion is in the Data Invertability(19.1.2) section.

    3.2.2 Floating Point Conversion

    Whenever a floating point conversion between different representations occurs, including to/from non-floating point representations, the following rules apply.

    These are rules for converting from a higher range representation to a lower range representation:

    These are rules for converting from a lower precision/range representation to a higher precision/range representation:

    3.2.3 Integer Conversion

    3.2.3.1 Terminology

    The following set of terms are subsequently used to characterize various integer format conversions.

    TermDefinition
    SNORMSigned normalized integer, meaning that for an n-bit 2's complement number, the maximum value means 1.0f (e.g. the 5-bit value 01111 maps to 1.0f), and the minimum value means -1.0f (e.g. the 5-bit value 10000 maps to -1.0f). In addition, the second-minimum number maps to -1.0f (e.g. the 5-bit value 10001 maps to -1.0f). There are thus two integer representations for -1.0f. There is a single representation for 0.0f, and a single representation for 1.0f. This results in a set of integer representations for evenly spaced floating point values in the range (-1.0f...0.0f), and also a complementary set of representations for numbers in the range (0.0f...1.0f)
    UNORMUnsigned normalized integer, meaning that for an n-bit number, all 0's means 0.0f, and all 1's means 1.0f. A sequence of evenly spaced floating point values from 0.0f to 1.0f are represented. e.g. a 2-bit UNORM represents 0.0f, 1/3, 2/3, and 1.0f.
    SINTSigned integer. 2's complement integer. e.g. an 3-bit SINT represents the integral values -4, -3, -2, -1, 0, 1, 2, 3.
    UINTUnsigned integer. e.g. a 3-bit UINT represents the integral values 0, 1, 2, 3, 4, 5, 6, 7
    FLOATA floating-point value in any of the representations defined by D3D11.
    SRGBSimilar to UNORM, in that for an n-bit number, all 0's means 0.0f and all 1's means 1.0f. However unlike UNORM, with SRGB the sequence of unsigned integer encodings between all 0's to all 1's represent a nonlinear progression in the floating point interpretation of the numbers, between 0.0f to 1.0f. Roughly, if this nonlinear progression, SRGB, is displayed as a sequence of colors, it would appear as a linear ramp of luminosity levels to an "average" observer, under "average" viewing conditions, on an "average" display. For complete detail, refer to the SRGB color standard, IEC 61996-2-1, at IEC (International Electrotechnical Commission)

    Note that the terms above are also used as Format Name Modifiers(19.1.3.2), where they describe both how data is layed out in memory and what conversion to perform in the transport path (potentially including filtering) from memory to/from a Pipeline unit such as a Shader. See the Formats(19.1) section to see exactly how these names are used in the context of resource formats.

    What follows are descriptions of conversions from various representations described above to other representations. Not all permutations are shown, but at least all the ones that show up in D3D11 somewhere are shown.

    3.2.3.2 Integer Conversion Precision

    Unless otherwise specified for specific cases, all conversions to/from integer representations to float representations described below must be done exactly. Where float arithmetic is involved, FULL IEEE-754 precision is required (1/2 ULP(3.1.2) of the infinitely precise result), which is stricter than the general D3D11 Floating Point Rules(3.1).

    3.2.3.3 SNORM -> FLOAT

    Given an n-bit integer value representing the signed range [-1.0f to 1.0f], conversion to floating-point is as follows:

    3.2.3.4 FLOAT -> SNORM

    Given a floating-point number, conversion to an n-bit integer value representing the signed range [-1.0f to 1.0f] is as follows:

    This conversion is permitted tolerance of 0.6f ULP(3.1.2) (on the integer side). This means that after converting from float to integer scale, any value within 0.6f ULP(3.1.2) of a representable target format value is permitted to map to that value. The additional Data Invertability(19.1.2) requirement ensures that the conversion is nondecreasing across the range and all output values are attainable.

    Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.

    3.2.3.5 UNORM -> FLOAT

    3.2.3.6 FLOAT -> UNORM

    This conversion is permitted tolerance of 0.6f ULP(3.1.2) (on the integer side). This means that after converting from float to integer scale, any value within 0.6f ULP(3.1.2) of a representable target format value is permitted to map to that value. The additional Data Invertability(19.1.2) requirement ensures that the conversion is nondecreasing across the range and all output values are attainable.

    Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.

    3.2.3.7 SRGB -> FLOAT

    The following is the ideal SRGB to FLOAT conversion.

    This conversion will be permitted a tolerance of 0.5f ULP(3.1.2) (on the SRGB side). The procedure for measuring this tolerance, given that it is relative to the SRGB side even though the result is a FLOAT, is to convert the result back into SRGB space using the ideal FLOAT -> SRGB conversion specified below, but WITHOUT the rounding to integer, and taking the floating point difference versus the original SRGB value to yield the error. There are a couple of exceptions to this tolerance, where exact conversion is required: 0.0f and 1.0f (the ends) must be exactly achievable.

    3.2.3.8 FLOAT -> SRGB

    The following is the ideal FLOAT -> SRGB conversion.

    Assuming the target SRGB color component has n bits:

    This conversion is permitted tolerance of 0.6f ULP(3.1.2) (on the integer side). This means that after converting from float to integer scale, any value within 0.6f ULP(3.1.2) of a representable target format value is permitted to map to that value. The additional Data Invertability(19.1.2) requirement ensures that the conversion is nondecreasing across the range and all output values are attainable.

    Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.

    3.2.3.9 SINT -> SINT (With More Bits)

    To convert from SINT to an SINT with more bits, the MSB bit of the starting number is "sign-extended" to the additional bits available in the target format.

    3.2.3.10 UINT -> SINT (With More Bits)

    To convert from UINT to an SINT with more bits, the number is copied to the target format's LSBs and additional MSB's are padded with 0.

    3.2.3.11 SINT -> UINT (With More Bits)

    To convert from SINT to UINT with more bits: If negative, the value is clamped to 0. Otherwise the number is copied to the target format's LSBs and additional MSB's are padded with 0.

    3.2.3.12 UINT -> UINT (With More Bits)

    To convert from UINT to UINT with more bits the number is copied to the target format's LSBs and additional MSB's are padded with 0.

    3.2.3.13 SINT or UINT -> SINT or UINT (With Fewer or Equal Bits)

    To convert from a SINT or UINT to SINT or UINT with fewer or equal bits (and/or change in signedness), the starting value is simply clamped to the range of the target format.

    3.2.4 Fixed Point Integers

    Fixed point integers are simply integers of some bit size that have an implicit decimal point at a fixed location. The ubiquitous "integer" data type is a special case of a fixed point integer with the decimal at the end of the number. Fixed point number representations are characterized as: i.f, where i is the number of integer bits and f is the number of fractional bits. e.g. 16.8 means 16 bits integer followed by 8 bits of fraction. The integer part is stored in 2's complement, at least as defined here (though it can be defined equally for unsigned integers as well). The fractional part is stored in unsigned form. The fractional part always represents the positive fraction between the two nearest integral values, starting from the most negative. Exact details of fixed point representation, and mechanics of conversion from floating point numbers are provided below.

    Addition and subtraction operations on fixed point numbers are performed simply using standard integer arithmetic, without any consideration for where the implied decimal lies. Adding 1 to a 16.8 fixed point number just means adding 256, since the decimal is 8 places in from the least significant end of the number. Other operations such as multiplication, can be performed as well simply using integer arithmetic, provided the effect on the fixed decimal is accounted for. For example, multiplying two 16.8 integers using an integer multiply produces a 32.16 result.

    Fixed point integer representations are used in a couple of places in D3D11:

    3.2.4.1 FLOAT -> Fixed Point Integer

    The following is the general procedure for converting a floating point number n to a fixed point integer i.f, where i is the number of (signed) integer bits and f is the number of fractional bits:

    Note: Sign of zero is preserved.

    For D3D11 implementations are permitted 0.6f ULP(3.1.2) tolerance in the integer result vs. the infinitely precise value n*2^f after the last step above.

    The diagram below depicts the ideal/reference float to fixed conversion (including round-to-nearest-even), yielding 1/2 ULP accuracy to an infinitely precise result, which is more accurate than required by the tolerance defined above. Future D3D versions will require exact conversion like this reference.

    Specific choices of bit allocations for fixed point integers are listed in the places in the D3D11 spec where they are used.

    3.2.4.2 Fixed Point Integer -> FLOAT

    Assume that the specific fixed point representation being converted to float does not contain more than a total of 24 bits of information, no more than 23 bits of which is in the fractional component. Suppose a given fixed point number, fxp, is in i.f form (i bits integer, f bits fraction). The conversion to float is akin to the following pseudocode:

    float result = (float)(fxp >> f) +              // extract integer
                   ((float)(fxp & (2f - 1)) / (2f)); // extract fraction
    

    Although the situation rarely, if ever arises, consider that a number that originates as fixed point, gets converted to float32, and then gets converted back to fixed point will remain identical to its original value. This holds provided that bit representation for the fixed point number does not contain more information than can be represented in a float32. This lossless conversion property does not hold when making the opposite round-trip, starting from float32, moving to fixed-point, and back; indeed lossy conversion is in fact the "point" of converting from float32 to fixed-point in the first place.

    One final note on round-trip conversion. Observe that when the float32 number -2.75 is converted to fixed-point, it becomes -3 +0.25, that is, the integer part is negative but the fixed point part, considered by itself, is positive. When that is converted back to float32, it becomes -2.75 again, since floating point stores negative numbers in sign-magnitude form, instead of in two's complement form.


    3.3 Coordinate Systems


    Section Contents

    (back to chapter)

    3.3.1 Pixel Coordinate System
    3.3.2 Texel Coordinate System
    3.3.3 Texture Coordinate Interpretation


    3.3.1 Pixel Coordinate System

    The Pixel Coordinate System defines the origin as the upper-left corner of the RenderTarget. Pixel centers are therefore offset by (0.5f,0.5f) from integer locations on the RenderTarget. This choice of origin makes rendering screen-aligned textures trivial, as the pixel coordinate system is aligned with the texel coordinate system.

    D3D9 and prior had a terrible Pixel Coordinate System where the origin was the center of the top left pixel on the RenderTarget. In other words, the origin was (0.5,0.5) away from the upper left corner of the RenderTarget. There was the nice property that Pixel centers were at integer locations, but the fact this was misaligned with the texture coordinate system frequently burned unsuspecting developers. Further, with Multisample rendering, thre was a 1/2 pixel wide region of the RenderTarget along the top and left edge that the viewport could not cover. D3D11 allows applications that want to emulate this behavior to specify a fractional offset to the top left corner of the viewport (-0.5,-0.5).

    3.3.2 Texel Coordinate System

    The texel coordinate system has its origin at the top-left corner of the texture. See the "Texel Coordinate System" diagram below. This is consistent with the Pixel Coordinate System.

    3.3.3 Texture Coordinate Interpretation

    The memory load instructions like sample(22.4.15) or ld(22.4.6) have a couple of ways texture coordinates are interpreted (normalized float, or scaled integer respectively). The "Texture Coordinate Interpretation" diagram below describes how these interpretations get mapped to specific texel(s), for point and linear sampling. The diagram does not illustrate address wrapping, which occurs after the shown equations for computing texel locations. The addressing math shown in this diagram is only a general guideline, and exact definition of texel selection arithmetic is provided in the Texture Sampling(7.18) section, including the role of Fixed Point(3.2.4.1) snapping of precision in the addressing process.


    3.4 Rasterization Rules


    Section Contents

    (back to chapter)

    3.4.1 Coordinate Snapping
    3.4.2 Triangle Rasterization Rules

    3.4.2.1 Top-Left Rule
    3.4.3 Aliased Line Rasterization Rules
    3.4.3.1 Interaction With Clipping
    3.4.4 Alpha Antialiased Line Rasterization Rules
    3.4.5 Quadrilateral Line Rasterization Rules
    3.4.6 Point Rasterization Rules


    3.4.1 Coordinate Snapping

    Consider a set of vertices going through the Rasterizer, after having gone through clipping, perspective divide and viewport scale. Suppose that any further primitive expansion has been done (e.g. rectangular lines can be drawn by implementations as 2 triangles, described later). After the final primitives to be rasterized have been obtained, the x and y positions of the vertices are snapped to exactly n.8 fixed point integers. Any front/back culling is applied (if applicable) after vertices have been snapped. Interpolation of pixel attributes is set up based on the snapped vertex positions of primitives being rasterized.

    3.4.2 Triangle Rasterization Rules

    Any pixel sample locations which fall inside the triangle are drawn. An example with a single sample per pixel (at the center) is shown below. If a sample location falls exactly on the edge of the triangle, the Top-Left Rule applies, to ensure that adjacent triangles do not overdraw. The Top-Left rule is described below.

    3.4.2.1 Top-Left Rule

    Top edge: If an edge is exactly horizontal, and it is above the other edges of the triangle in pixel space, then it is a "top" edge.

    Left edge: If an edge is not exactly horizontal, and it is on the left side of the triangle in pixel space, then it is a "left" edge. A triangle can have one or two left edges.

    Top-Left Rule: If a sample location falls exactly on the edge of a triangle, the sample is inside the triangle if the edge is a "top" edge or a "left" edge. If two edges from the same triangle touch the pixel center, then if both edges are "top" or "left" then the sample is inside the triangle.

    3.4.3 Aliased Line Rasterization Rules

    Rasterization rules for infinitely-thin lines, with no antialiasing, are described below.

    3.4.3.1 Interaction With Clipping

    One futher implication of these line rasterization rules is that lines that are geometrically clipped to the viewport extent may set one less pixel than lines that are rendered to a larger 2D extent with the pixels outside the viewport discarded. (This is due to the handling of the line endpoints.)

    Since geometric clip to the viewport is neither required nor disallowed, aliased line rendering is allowed to differ in viewport-edge pixels due to geometric clipping.

    3.4.4 Alpha Antialiased Line Rasterization Rules

    The alpha-based antialiased rasterization of a line (defined by two end vertices) is implemented as the visualization of a rectangle, with the line's two vertices centered on two opposite "ends" of the rectangle, and the other two edges separated by a width (in D3D11 width is only 1.0f). No accounting for connected line segments is done. The region of intersection of this rectangle with the RenderTarget is estimated by some algorithm, producing "Coverage" values [0.0f..1.0f] for each pixel in a region around the line. The Coverage values are multiplied into the Pixel Shader output o0.a value before the Output Merger Stage. Undefined results are produced if the PS does not output o0.a. D3D11 exposes no controls for this line mode.

    It is deemed that there is no single "best" way to perform alpha-based antialiased line rendering. D3D11 adopts as a guideline the method shown in the diagram below. This method was derived empirically, exhibiting a number of visual properties deemed desirable. Hardware need not exactly match this algorithm; tests against this reference shall have "reasonable" tolerances, guided by some of the principles listed further below, permitting various hardware implementations and filter kernel sizes. None of this flexibility permitted in hardware implementation, however, can be communicated up through D3D11 to applications, beyond simply drawing lines and observing/measuring how they look.

    The following is a listing of the "nice" properties that fall out of the above algorithm, which in general will be expected of hardware implementations (admittedly many of which are likely difficult to test):

    Note that the wider the filter kernel an implementation uses, the blurrier the line, and thus the more sensitive the resulting perceived line intensity is to display gamma. The reference implmentation's kernel is quite large, at 3x3 pixel units about each pixel.

    3.4.5 Quadrilateral Line Rasterization Rules

    Quadrilateral lines take 2 endpoints and turn them into a simple rectangle with width 1.4f, drawn with triangles. The attributes at each end of the line are duplicated for the 2 vertices at each end of the rectangle.

    This mode is not supported with center sample patterns (D3D11_CENTER_MULTISAMPLE_PATTERN) where there is more than one sample overlapping the center of the pixel, in which case results of drawing this style of line are undefined. See here(19.2.4.1).

    The width of 1.4f is an arbitrarily aesthetic choice, used in previous versions of D3D. With no good reason to change, it was left the same.

    3.4.6 Point Rasterization Rules

    For the purpose of rasterization, a point is represented as a square of width 1 oriented to the RenderTarget. Actual implementation may vary, but output behavior should be identical to what is described here. The coordinate for a point indentifies where the center of the square is located. Pixel coverage for points follows Triangle Rasterization Rules, interpreted as though a point is composed of 2 triangles in a Z pattern, with attributes duplicated at the 4 vertices. Cull modes do not apply to points.


    3.5 Multisampling


    Section Contents

    (back to chapter)

    3.5.1 Overview
    3.5.2 Warning about the MultisampleEnable State
    3.5.3 Multisample Sample Locations And Reconstruction
    3.5.4 Effects of Sample Count > 1

    3.5.4.1 Sample-Frequency Execution and Rasterization
    3.5.4.1.1 Invariance Property
    3.5.5 Centroid Sampling of Attributes
    3.5.6 Target Independent Rasterization
    3.5.6.1 Forcing Rasterizer Sample Count
    3.5.6.2 Rasterizer Behavior with Forced Rasterizer Sample Count
    3.5.6.3 Support on Feature Levels 10_0, 10_1, 11_0
    3.5.6.4 UAV-Only Rasterization with Multisampling
    3.5.7 Pixel Shader Derivatives


    3.5.1 Overview

    Multisample Antialiasing seeks to fight geometry aliasing, without necessarily dealing with surface aliasing (leaving that as a shading problem, e.g. texture filterng). This is accomplished by performing pixel coverage tests and depth/stencil tests at multiple sample locations per pixel, backed by storage for each sample, while only performing pixel shading calculations once for covered pixels (broadcasting Pixel Shader output across covered samples). It is also possible to request Pixel Shader invocations to occur at sample-frequency rather than at pixel-frequency.

    3.5.2 Warning about the MultisampleEnable State

    The MultisampleEnable Rasterizer State remains as an awkward leftover from D3D9. It no longer does what the name implies; it no longer has any bearing on multisampling; it only controls line rendering behavior now. The state should have been renamed/refactored, but the opportunity was missed in D3D11. For a detailed discussion about what this state actually does now, see State Interaction With Point/Line/Triangle Rasterization Behavior(15.14).

    3.5.3 Multisample Sample Locations And Reconstruction

    Specifics about sample locations and reconstruction functions for multisample antialiasing are dependent on the chosen Multisample mode, which is outside the scope of this section. See Multisample Format Support(19.2), and Specification of Sample Positions(19.2.4).

    3.5.4 Effects of Sample Count > 1

    Rasterization behavior when sample count is greater than 1 is simply that primitive coverage tests are done for each sample location within a pixel. If one or more sample locations in a pixel are covered, the Pixel Shader is run once for the pixel in Pixel-Frequency mode, or in Sample-Frequency mode once for each covered sample that is also in the Rasterizer SampleMask. Pixel-frequency execution produces a single set of Pixel Shader output data that is replicated to all covered samples that pass their individual depth/stencil tests and blended to the RenderTarget per-sample. Sample-frequency execution produces a unique set of Pixel Shader output data per covered sample (and in SampleMask), each output getting blended 1:1 to the corresponding RenderTarget sample if its depth/stencil test passes.

    3.5.4.1 Sample-Frequency Execution and Rasterization

    Note that points(3.4.6) and quadrilateral lines(3.4.5) are functionally equivalent to drawing their area with triangles. So Sample-Frequency execution is easily defined for all of these primitives. For points, the samples covered by the point area (and in the RasterizerState's SampleMask) each get Pixel Shader invocations with attributes replicated from its single vertex (except one parameter is available that is varying - an ID identifying each sample from the total set of samples in the pixel). For quadrilateral lines, the two end vertices define how attributes interpolate along the length, staying constant across the perpendicular. Again, the samples covered by the area of the primitive (and in the SampleMask) each get a Pixel Shader invocations in Sample-Frequency execution mode, with unique input attributes per sample, including an ID identifying which sample it is.

    Alpha-Antialiased Lines(3.4.4) and Aliased Lines(3.4.3) are algorithms that inherently do not deal with discrete sample locations within a pixel's area, and thus it is illegal/undefined to request Sample-Frequency execution for these primitives, unless the sample count is 1, which is identical to Pixel-Frequency execution.

    3.5.4.1.1 Invariance Property

    Consider a Pixel Shader that operates only on pixel-frequency inputs (e.g. all attributes have one of the following interpolation modes(16.4): constant, linear, linear_centroid, linear_noperspective or linear_noperspective_centroid). Implementations need only execute the shader once per pixel and replicate the results to all samples in the pixel. Now suppose code is added to the shader that generates new outputs based on reading sample-frequency inputs. The existing pixel-frequency part of the shader behaves identically to before. Even though the shader will now execute at sample-frequency (so the new outputs can vary per-sample), each invocation produces the same result for the original outputs.

    Though this example happens to separate out the different interpolation frequencies to highlight their invariance, of course it is perfectly valid in general for shader code to mix together inputs with any different interpolation modes.

    3.5.5 Centroid Sampling of Attributes

    When a sample-frequency interpolation mode(16.4) is not needed on an attribute, pixel-frequency interpolation-modes such as linear evaluate at the pixel center. However with sample count > 1 on the RenderTarget, attributes could be interpolated at the pixel center even though the center of the pixel may not be covered by the primitive, in which case interpolation becomes "extrapolation". This "extrapolation" can be undesirale in some cases, so short of going to sample-frequency interpolation, a compromise is the centroid interpolation mode.

    Centroid behaves exactly as follows:


    3.5.6 Target Independent Rasterization

    The term Conservative Rasterization has been used to describe basically a GPU rasterizer assist for shader computed antialiasing. This concept has not been actually implemented in GPUs, at least that are known, but the following short discussion of Conservative Rasterization somewhat motivates the alternative that is specified here - Target Independent Rasterization. Note that as of D3D11.3, hardware has evolved to support Conservative Rasterization(15.17).

    Consider how multisampling works in D3D (or GPU rasterization in general). Each pixel has “sample” positions which cause Pixel Shaders to be invoked when primitives (e.g. triangles) cover the samples. For multisampling, a single Pixel Shader invocation occurs when at least one sample in a pixel is covered. Alternatively, D3D10.1+ also allows the shader to request that the Pixel Shader be invoked for each covered sample – this has historically been called “supersampling”.

    The downside to these antialiasing approaches is they are based on a discrete number of samples. The more samples the better, but there are still holes in the pixel area between the sample points in which geometry rendered there does not contribute to the image.

    Conservative Rasterization, instead, would ideally invoke the Pixel Shader if the area of a primitive (e.g. triangle) being rendered has any chance of intersecting with the pixel’s square area. It would then be up to shader code to compute whatever measure of pixel area intersection it desires. It may be acceptable for the rasterization to be “conservative” in that triangles/primitives are simply rasterized with a fattened screen space area that could include some pixels with no actual coverage – it doesn’t really matter since the shader will be computing the actual coverage.

    The win is that the number of Pixel Shader invocations is reasonably bounded to the triangle extents (as opposed to rendering bounding rectangles), and the output can be “perfect” antialiasing if desired. This is particularly the case if also utilizing some other features in D3D11 that allow arbitrary length lists to be recorded per pixel.

    However, the complexity of the shader code required to compute an analytic coverage solution with Conservative Rasterization might be too high for the benefit. An alternative scheme, Target Independent Rasterization is defined here, under the more mundane heading 'Forcing Rasterizer Sample Count' below. First though, some discussion about how Target Independent Rasterization can help in at least one scenario - path rendering in Direct2D.

    A common usage scenario of Direct2D is to stroke and/or fill anti-aliased paths. The semantics of the Direct2D anti-aliasing scheme are different from MSAA. The key difference is when the resolve step occurs. With MSAA the resolve step typically happens once per frame. With Direct2D anti-aliasing the resolve step occurs after each path is rendered. To work around these semantic differences the Windows 7 version of Direct2D performs rasterization on the CPU. When a path is to be filled or stroked, an expensive CPU-based algorithm computes the percentage of each pixel that is covered by the path. The GPU is used to multiply the path color by the coverage and blend the results with the existing render target contents. This approach is heavily CPU-bound.

    Target Independent Rasterization enables Direct2D to move the rasterization step from the CPU to the GPU while still preserving the Direct2D anti-aliasing semantics. Rendering of anti-aliased paths will be performed in 2 passes on the GPU. The first pass will write per-pixel coverage to an intermediate render target texture. Paths will be tessellated into non-overlapping triangles. The GPU will be programmed to use Target Independent Rasterization and additive blending during the first pass. The pixel shader used in the first pass will simply count the number of bits set in the coverage mask and output the result normalized to [0.0,1.0]. During the second pass the GPU will read from the intermediate texture and write to the application’s render target. This pass will multiply the path color by the coverage computed during the first pass.

    In some cases, it will be faster for Direct2D to tessellate paths into potentially overlapping triangles. In these cases, the 1st pass will set the ForcedSampleCount to 16 and simply output the coverage mask to the intermediate (R16_UINT). The blender would be setup to do a bitwise OR, or XOR operation (depending on the scenario). The second pass would read this 16-bit value from the intermediate, count the number of bits set, and modulate the color being written to the render target.

    There are 2 fallbacks that could be used to implement this algorithm on GPUs that do not support Target Independent Rasterization. The first fallback would render the scene N times, with alpha = 1/N and additive blending for the first step of the algorithm. This would produce the same results, but at the cost of resorting to multipass rendering to to mimic the effect of supersampling at the rasterizer. The second fallback would use MSAA to implement the first pass of the algorithm. Both fallbacks are bound by memory bandwidth (render target writes). Using Target Independent Rasterization would significantly reduce the memory bandwidth requirements of this algorithm.

    3.5.6.1 Forcing Rasterizer Sample Count

    Overriding the Rasterizer sample count means defining the multisample pattern at the Rasterizer independent of what RenderTargetViews(5.2) (or UnorderedAccessView(5.3.9)s) may be bound at the Output Merger (and their associated sample count / Quality Level).

    The ForcedSampleCount state setting is located in the Rasterizer State(15.1) object.

        UINT ForcedSampleCount; // Valid values for Target Independent Rasterization (TIR): 0, 1, 4, 8, 16
                                // Valid values for UAV(5.3.9) only render: 0, 1, 4, 8, 16
                                // 0 means don't force sample count.
    

    Devices must support all the standard sample patterns up to and including 16 for the ForcedSampleCount. This is even if the device does not support that many samples in RenderTarget / DepthStencil resources.

    Investigations show that the 16 sample standard D3D pattern performs favorably with Direct2D's original software based rasterization pattern, which had the significant disadvantage of using a regular grid layout, even though it was 64 samples.

    3.5.6.2 Rasterizer Behavior with Forced Rasterizer Sample Count

    With a forced sample count/pattern selected at the rasterizer (ForcedSampleCount > 0), pixels are candidates for shader invocation based on the selected sample pattern, independent of the RTV ("output") sample count. The burden is then on shader code to make sense of the possible mismatch between rasterizer and output storage sample count, given the defined semantics.

    Here are the behaviors with ForcedSampleCount > 0.

    The above functionality is required for Feature Level 11_1 hardware.

    3.5.6.3 Support on Feature Levels 10_0, 10_1, 11_0

    D3D10.0 - D3D11.0 hardware (and Feature Level 10_0 - 11_0) supports ForcedSampleCount set to 1 (and any sample count for RTV) along with the described limitations (e.g. no depth/stencil).

    For 10_0, 10_1, and 11_0 hardware, when ForcedSampleCount is set to 1, line rendering cannot be configured to 2-triangle (quadrilateral) based mode (i.e. the MultisampleEnable state cannot be set to true). This limitation isn't present for 11_1 hardware. Note the naming of the 'MultisampleEnable' state is misleading since it no longer has anything to do with enabling multisampling; instead it is now one of the controls along with AntialiasedLineEnable for selecting line rendering mode.

    This limited form of Target Indepdendent Rasterization, ForcedSampleCount = 1, closely matches a mode that was present in D3D10.0 but due to API changes became unavailable for D3D10.1 and D3D11 (and Feature Levels 10_1 and 11_0). In D3D10.0 this mode was the center sampled rendering even on an MSAA surface that was available when MultisampleEnable was set to false (and this could be toggled by toggling MultisampleEnable). In D3D10.1+, MultisampleEnable no longer affects multisampling (despite the name) and only controls line rendering behavior. It turns out some software, such as Direct2D, depended on this mode to be able to render correctly on MSAA surfaces. As of D3D11.1, D2D can use ForcedSampleCount = 1 to bring back this mode consistently on all D3D10+ hardare and Feature Levels. D3D10.0 also supported depth testing in this mode as well, but it is not worth exposing that given it D2D did not expose it, and the full D3D11.1 definition of the feature doesn't work with depth/stencil.

    3.5.6.4 UAV-Only Rasterization with Multisampling

    D3D11 allows rasterization with only UAVs bound, and no RTVs/DSVs. Even though UAVs can have any/different sizes, essentially, the viewport/scissor identify the pixel dimensions. Before this feature, when rendering with only UAVs bound, the rasterizer was limited to a single sample only.

    UAV(5.3.9)-only rendering with multisampling at the rasterizer is possible by keying off the ForcedSampleCount state described earlier, with the sample patterns limited to 0, 1, 4, 8 and 16. (The UAVs themselves are not multisampled in terms of allocation.) A setting of 0 is equivalent to the setting 1 - single sample rasterization.

    Shaders can request pixel-frequency invocation with UAV-only rendering, but requesting sample-frequency invocation is invalid (produces undefined shading results).

    The SampleMask Rasterizer State does not affect rasterization behavior at all here.

    On D3D11.0 hardware, ForcedSampleCount can be 0, 1, 4 and 8 with UAV only Rasterization. D3D11.1 hardware additionally supports 16.

    Attempting to render with unsupported ForcedSampleCount produces undefined rendering results - though if a ForcedSampleCount is chosen that could never be valid for TIR or UAV-only rendering the runtime will fail the Rasterizer State object creation immediately.

    3.5.7 Pixel Shader Derivatives

    Pixel Shaders always run in minimum 2x2 quanta to be able to support derivative calculations, regardless of the RenderTarget sample count. These Pixel Shader derivative calculations, used in texture filtering operations, but also available directly in shaders, are calculated by taking deltas of data in adjacent pixels. This requires data in each pixel has been sampled with unit spacing horizontally or vertically.

    RenderTarget sample counts > 1 do not affect derivative calculation methods. If derivatives are requested on an attribute that has been Centroid sampled, the hardware calculation is not adjusted, and therefore incorrect derivatives will often result. What the Shader expects to be a derivative wrt a unit distance in the x or y direction in RenderTarget space will actually be the rate of change with respect to some other direction vector, which also probably isn't unit length.

    The point here is that it is the application's responsibility to exhibit caution when requesting derivative from Centroid sampled attributes, ideally never requesting them at all. Centroid sampling can be useful for situations where it is critical that a primitive's interpolated attributes are not "extrapolated", but this comes with some tradeoffs: First, centroid sampled attributes may appear to jump around as a primitive edge moves over a pixel, rather than changing continuously. Secondly, derivative calculations on the attributes become unreliable or difficult to use correctly (which also hurts texture sampling operations that derive LOD from derivatives).

    Under sample-frequency execution, a 2x2 quad of Pixel Shaders executes for each sample index where that sample is covered in at least one of the pixels participating in the 2x2 quad. This allows derivatives to be calculated in the usual way since any given sample is located one unit apart horizonally or vertically from the corresponding sample in the neighboring pixels.

    It is left to the application's shader author to decide how to adjust for the fact that derivatives calculated from spacings of one unit may need to be scaled in some way to reflect higher frequency shader execution, depending on the sample pattern/count.

    Further important discussion of Pixel Shader derivatives is under Interaction of Varying Flow Control With Screen Derivatives(16.8).


    4 Rendering Pipeline


    Chapter Contents

    (back to top)

    4.1 Minimal Pipeline Configurations
    4.2 Fixed Order of Pipeline Results
    4.3 Shader Programs
    4.4 The Element


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    The rendering Pipeline encapsulates all state related to the rendering of a primitive. This includes a sequence of pipeline stages as well as various state objects.


    4.1 Minimal Pipeline Configurations


    Section Contents

    (back to chapter)

    4.1.1 Overview
    4.1.2 No Buffers at Input Assembler
    4.1.3 IA + VS (+optionally GS) + No PS + Writes to Depth/Stencil Enabled
    4.1.4 IA + VS (+optionally GS) + PS (incl. Rasterizer, Output Merger)
    4.1.5 IA + VS + SO
    4.1.6 No RenderTarget(s) and/or Depth/Stencil and/or Stream Output
    4.1.7 IA + VS + HS + Tessellation + DS + ...
    4.1.8 Compute alone
    4.1.9 Minimal Shaders


    4.1.1 Overview

    Not all Pipeline Stages must be active. This section clarifies this concept by illustrating some minimal configurations that can produce useful results. The Graphics pipeline is accessed by Draw* calls from the API. The alternative pipeline, Compute, is accessed by issuing Dispatch* calls from the API.

    For the Graphics pipepine, the Input Assembler is always active, as it produces pipeline work items. In addition, the Vertex Shader is always active. Relying on the presence of the Vertex Shader at all times simplifies data flow permutations very significantly, versus allowing the Input Assembler with its limited programming flexibility to feed any pipeline stage.

    Note that even though the Vertex Shader must always be active in the Graphics pipeline, in scenarios where applications really don't want to have a Vertex Shader, and must simply implement it as a trivial or nearly trivial sequence of mov's from inputs to outputs, the short length and simplicity of such "passthrough" shaders should not be a problem for hardware implementations to practically hide the cost of, one way or another.

    4.1.2 No Buffers at Input Assembler

    A minimal use of the Input Assembler is to not have any input Buffers bound (vertex or index data). The Input Assembler can generate counters such as VertexID(8.16), InstanceID(8.18) and PrimitiveID(8.17), which can identicy vertices/primitives generated in the pipeline by Draw*(), or DrawIndexed*() (if at least an Index Buffer is bound). Thus Shaders can minimally drive all their processing based on the IDs if desired, including fetching appropriate data from Buffers or Textures.

    4.1.3 IA + VS (+optionally GS) + No PS + Writes to Depth/Stencil Enabled

    If the shader stage before the rasterizer outputs position, and Depth/Stencil writes are enabled, the rasterizer will simply perform the fixed-function depth/stencil tests and updates to the Depth/Stencil buffer, even if there is no Pixel Shader active. No Pixel Shader means no updates to RenderTargets other than Depth/Stencil.

    4.1.4 IA + VS (+optionally GS) + PS (incl. Rasterizer, Output Merger)

    The Input Assembler + Vertex Shader (required) can drive the Pixel Shader directly (GS does not have to be used, but can be). If an application seeks to write data to RenderTarget(s), not including Depth/Stencil which were explained earlier, the Pixel Shader must be active. This implicitly Output Merger as well, though as described further below, there's no requirement that RenderTargets need to be bound just because rasterization is occuring.

    4.1.5 IA + VS + SO

    The Input Assembler (+required VS) can feed Stream Output directly with no other stages active. Note that as described in the Stream Output Stage(14) section, Stream Output is tied to the Geometry Shader, however a "NULL" Geometry Shader can be specified, allowing the outputs of the Vertex Shader to be sent to Stream Output with no other stages active.

    4.1.6 No RenderTarget(s) and/or Depth/Stencil and/or Stream Output

    Whether or not the Pixel Shader is active, it is always legal to NOT have any output targets bound (and/or have output masks defined so that no output targets are written). Likewise for Stream Output. This might be interesting for performance tests which don't include output memory bandwidth (and which might examine feedback statistics such as shader invocation counts, which is itself a form of pipeline output anyway).

    The Input Assembler (+required VS) can feed Stream Output directly with no other stages active. Note that as described in the Stream Output Stage(14) section, Stream Output is tied to the Geometry Shader, however a "NULL" Geometry Shader can be specified, allowing the outputs of the Vertex Shader to be sent to Stream Output with no other stages active.

    4.1.7 IA + VS + HS + Tessellation + DS + ...

    Take any of the configurations above, and HS + Tessellator + DS can be inserted after the VS. The presence of the DS is what implises the presence of the Tessellator before it.

    4.1.8 Compute alone

    When the Compute Shader runs, it runs by itself. The state for both the Graphics pipeline shaders and Compute Shader can be simultaneously bound. The selection of which pipeline to use is Draw* invokes Graphics and Dispatch* invokes Compute.

    4.1.9 Minimal Shaders

    All vertex shaders must have a minimum of one input and one output, which can be as little as one scalar value. Note that System Generated Values such as VertexID(8.16) and InstanceID(8.18) count as input.


    4.2 Fixed Order of Pipeline Results

    The rendering Pipeline is designed to allow hardware to execute tasks at various stages in parallel. However observable rendering results must match results produced by serial processing of tasks. Whenever a task in the Pipeline could be performed either serially or in parallel, the results produced by the Pipeline must match serial operation. That is, the order that tasks enter the Pipeline is the order that tasks are observed to be propagated all the way through to completion. If a task moving through the Pipeline generates additional sub-tasks, those sub-tasks are completed as part of completing the spawning task, before any subsequent tasks are completed. Note that this does not prevent hardware from executing tasks out of order or in parallel if desirable, just as long as results are buffered appropriately such that externally visible results reflect serial execution.

    One exception to this fixed ordering is with Tessellation. With the fixed function Tessellation stage, implementations are free to generate points and topology in any order as long as that order is consistent given the same input on the same device. Vertices can even be generated multiple times in the course of tessellating a patch, as long as the Tessellator output topology is not point (in which case only the unique points in the patch must be generated). This tessellator exception is discussed here(11.7.9).

    Another exception to the fixed ordering of pipeline results is any access to an Unordered Transaction View of a Resource (for example via the Compute Shader or Pixel Shader). These types of Views explicitly allow unordered results, leaving the burden to applications to make careful choices of atomic instructions to access Unordered Transaction Views if deterministic and implementation invariant output is desired.


    4.3 Shader Programs

    A Shader object encapsulates a Shader program for any type of Shader unit. All shaders have a common binary format and basically have the following typical layout. A helpful reference for this is the source code accompanying the Reference Rasterizer, which includes facilities for parsing the shader binary.

    The Tessellation related shaders have a significantly different structure, particularly the Hull Shader, which appears as multiple phases of shaders concatenated together (not depicted here).

    version
    input declarations
    output declarations
    resource declarations
    code
    
    version
        describes the Shader type: Vertex Shader(vs),
        Hull Shader (hs), Domain Shader (ds),
        Geometry Shader (gs), Pixel Shader (ps),
        Compute Shader (cs).
        Example: vs_5_0, ps_5_0
    input declarations
        declare which input registers are read
        Example:
            dcl_input v[0]
            dcl_input v[1].xy
            dcl_input v[2]
    output declarations
        declare which output registers are written
        Example:
            dcl_output o[0].xyz
            dcl_output o[1]
            dcl_output o[2].xw
    resource declarations
        Example:
            dcl_resource t0, Buffer, UNORM
            dcl_resource t2, Texture2DArray, FLOAT
    code
        This Shader section contains executable instructions.
    

    4.4 The Element


    Section Contents

    (back to chapter)

    4.4.1 Overview
    4.4.2 Elements in the Pipeline
    4.4.3 Passing Elements Through Pipeline Interfaces

    4.4.3.1 Memory-to-Stage Interface
    4.4.3.2 Stage-to-Stage Interface
    4.4.3.2.1 Varying Frequencies of Operation
    4.4.3.3 Stage-to-Memory Interface
    4.4.4 System Generated Values
    4.4.5 System Interpreted Values
    4.4.6 Element Alignment


    4.4.1 Overview

    From the perspective of individual D3D11.3 Pipeline stages accessing and interpreting memory, all memory layouts (e.g. Buffer, Texture1D/2D/3D/Cube) are viewed as being composed of "Elements". An individual Element represents a vector of anywhere from 1 to 4 values. An Element could be an R8G8B8A8 packing of data, a single 8-bit integer value, 4 float32 values, etc. In particular, an Element is any one of the DXGI_FORMAT_* formats(19.1), e.g. DXGI_FORMAT_R8G8B8A8 (DXGI stands for "DirectX Graphics Infrastructure", a software component outside the scope of this specification which happens to own the list of DirectX formats going forward). Filtering may be involved in the process of fetching an Element from a texture, and this simply involves looking at multiple values for a given Element in memory and blending them in some fashion to produce an Element that is returned to the Shader.

    Buffers in memory can be made up of structures of Elements (as opposed to being a collection of a single Element). For example a Buffer could represent an array of vertices, each vertex containing several elements, such as: position, normal and texture coordinates. See the Resources(5) section for full detail.

    4.4.2 Elements in the Pipeline

    The concept of "Elements" does not only apply to resources. Elements also characterize data passing from one Pipeline stage to the next. For example the outputs of a Vertex Shader (Elements making up a vertex) are typically read into a subsequent Pipeline stage as input data, for instance into a Geometry Shader. In this scenario, the Vertex Shader writes values to output registers, each of which represents an individual Element. The subsequent Shader (Geometry Shader in this example) would see a set of input registers each initialized with an Element out of the set of input data.

    4.4.3 Passing Elements Through Pipeline Interfaces

    There are various types of data interfaces in the hardware Pipeline through which Elements pass. This section describes the interfaces in generic terms, and characterizes how Elements of data pass through them. Specific descriptions for each of the actual interfaces in the Pipeline are provided throughout the spec, in a manner consistent with the principles outlined here. The overall theme here is that data mappings through all interfaces are always direct, without any linkage resolving required.

    4.4.3.1 Memory-to-Stage Interface

    The first type of interface is Memory-to-Stage, where an Element from a Resource (Texture/Buffer) is being fetched into the some part of the Pipeline, possibly the "top" of the Pipeline (Input Assembler(8)), or the "side", meaning a fetch driven from within a Shader Stage. At the point of binding of memory Resources to these interfaces, a number is given to each Element that is bound, representing which input (v#) or texture (t#) "register" at the particular interface refers to the Element. Note that there is no linkage resolving done on behalf of the application; the Shader assumes which "registers" will refer to particular Elements in memory, and so when memory is bound to the interface, it must be bound (or declared, in cases where multiple Elements come from the same Resource in memory) at the "register" expected by the Shader.

    For Memory-to-Stage interfaces, Elements always provide to the Shader 4 components of data, with defaults provided for Elements in memory containing fewer than 4 components (though this can be masked to be any subset of the 4 components in the Shader if desired).

    For interfaces on the "side", where memory Resources are bound to Shader Stages so they can be fetched from via Shader code, the set of binding points (t# registers in the Shader) cannot be dynamically indexed within the Shader program without using flow control.

    On the other hand, the interface at the "top" of the Pipeline (the input v# registers of the first active Shader Stage) can be dynamically indexed as an array from Shader code. The Elements in v# registers being indexed must have a declaration(22.3.30) specifying each range that is to be indexed, where each range specifies a contiguous set of Elements/v# registers, ranges do not overlap, and the components declared for each Element in a given range are identical across the range.

    4.4.3.2 Stage-to-Stage Interface

    The second type of interface is Stage-to-Stage, where one Pipeline Stage outputs a set of 4 component Elements (written to output o# registers) to the subsequent active Pipeline Stage, which receives Elements in its input v# registers. The mapping of output registers in one Stage to input registers in the next Stage is always direct; so a value written to o3 always goes to v3 in the subsequent Stage. Any subset of the 4 components of any Element can be declared rather than the whole thing.

    If more Elements or components within Elements are output than are expected/declared for input by the subsequent Stage, the extra data gets discarded / becomes undefined. If fewer Elements or components within Elements are output than are expected/declared for input by the subsequent Stage, the missing data is undefined.

    Similar to the Memory-to-Stage interface at the "top" of the Pipeline, which feeds the input v# registers of the first active Pipeline Stage, at a Stage-to-Stage interface, writes to output Elements (o#) and at the subsequent Stage, reads from input elements (v#) can each be dynamically indexed as arrays from code at the respective Shaders. The Elements in o# registers being indexed must have a declaration(22.3.30) for each range, specifying a contiguous set of Elements/o# registers, without overlapping, and with the same component masks declared for each Element in a given range. The same applies to input v# registers at the subsequent stage (the array declarations for the input v# registers in the Shader are independent/orthogonal to the array declarations for o# in the previous Shader).

    4.4.3.2.1 Varying Frequencies of Operation

    There is a detail which is mostly orthogonal to the the Stage-to-Stage interface discussion above: the frequency of operation at subsequent Stages varies, in addition to different amounts of data different Stages can input. For example the Geometry Shader(13) inputs all the vertices for a primitive. The Pixel Shader(16) can choose to have its inputs inperpolated from vertices, or take the data from one. The point of the above discussion is only to describe the mechanism for Element transport through the interfaces independently of these varying frequencies of operation between Stages.

    4.4.3.3 Stage-to-Memory Interface

    The final type of interface is Stage-to-Memory, where a Pipeline Stage outputs a set of 4 component Elements (written to output o# registers) on a path out to memory. These interfaces (e.g. to RenderTargets or Stream Output) are somewhat the converse of the Memory-to-Stage Interface. Each memory Resource representing one or more Elements of output identifies each Element by a number #, corresponding directly to an output o# register. There is no linkage resolving done on behalf of the application; the application must associate target memory for Element output directly with each o# register that will provide it. Details on specifying these associations are unique for the different Stage-to-Memory interfaces (RenderTargets, Stream Output).

    If a Stage-to-Memory interface outputs more Elements or components within Elements than there are destination memory bindings to accommodate, the extra data is discarded. If a Stage-to-Memory interface outputs fewer Elements or components within Elements than there are destination memory bindings expecting to be written, undefined data will be output (i.e. no defaults). At RenderTarget output, there are various means to mask what data gets output, most interesting of which is depth testing, but that is outside the scope of this discussion.

    At the RenderTarget output interface (which is Pixel Shader(16) output), dynamic indexing of the o# registers is not supported. For the other Stage-to-Memory interface, Stream Output, indexing of outputs is permissible. Stream Output shares the output o# registers used for Stage-to-Stage output in the Geometry Shader(13) Stage, where indexing is permitted as defined for the Stage-to-Stage interface.

    4.4.4 System Generated Values

    There are various hardware generated values which can each be made available when for input to certain Shader Stages by declaring them for input to a component of an input register. A listing of each System Generated Value in D3D11.3 can be found in the System Generated Value Reference(23), but in addition, here are links to descriptions of some (not all) of the System Generated Values: VertexID(8.16), InstanceID(8.18), PrimitiveID(8.17), IsFrontFace(15.12).

    In the Hull Shader(10), Domain Shader(12) and Geometry Shader(13), PrimitiveID is a special case that has its own input register, but for all other cases of inputting hardware generated values into Shaders, (including the PrimitiveID into the Pixel Shader(16)), the Shader must declare a scalar component of one of its input v# registers as one of the System Generated Values to receive each input value. If that v# register also has some components provided by a the previous Stage or Input Assembler(8), the hardware generated value can only be placed in one of the components after the rest of the data. For example if the Input Assembler provides v0.xz, then VertexID might be declared for v0.w (since w is after z), but not v0.y. There cannot be overlap between the target for generated values and the target for values arriving from an upstream Stage or the Input Assembler.

    Hardware generated values that are input into the generic v# registers can only be input into the first active Pipeline Stage in a given Pipeline configuration that understands the particular value; from that point on it is the responsibility of the Shader to manually pass the values down if desired through output o# registers. If multiple Stages in the pipeline request a hardware generated value, only the first stage receives it, and at the subsequent stages, the declaration is ignored (though a prudent Shader programmer would pass down the value manually to correspond with the naming).

    Since VertexID(8.16), InstanceID(8.18) are both meaningful at a vertex level, and IDs generated by hardware can only be fed into the the first stage that understands them, these ID values can only be fed into the Vertex Shader. PrimitiveID(8.17) generated by hardware can only be fed into the Hull Shader, Domain Shader, as well as whichever of the follwing is the first remaining active stage: Geometry Shader or Pixel Shader.

    It is not legal to declare a range of input registers as indexable(22.3.30) if any of the registers in the range contains a System Generated Value.

    From the API point of view, System Generated Values and System Interpreted Values (below) may be exposed to developers as just once concept: "System Values" "SV_*".

    4.4.5 System Interpreted Values

    In many cases, hardware must be informed of the meaning of some of the application-provided or computed data moving through the D3D11.3 Pipeline, so the hardware may perform a fixed function operation using the data. The most obvious example is "position", which is interpreted by the Rasterizer (just before the Pixel Shader). Data flowing through the D3D11.3 Pipeline must be identified as a System Interpreted Value at the output interface between Stages where the hardware is expected to make use of the data. For the case where the Input Assembler(8) is the only Stage present in a Pipeline configuration before the place where the hardware is expected to interpret some data, the Input Assembler(8) has a mechanism for identifying System Interpreted Values to the relevant (components of) Elements it declares.

    A listing of each System Interpreted Value in D3D11.3 can be found in the System Interpreted Values Reference(24). Each System Interpreted Value has typically one place in the Pipeline where it is meaningful to the hardware. Also, there may be constraints on how many components in an Element need to be present (such as .xyzw for "position" going to the Rasterizer).

    If data produced by the Input Assembler or by the output o# registers of any Stage is identified as a System Interpreted Value at a point in the pipeline where the hardware has no use for interpreting the data, the label is silently ignored (and the data simply flows to the next active Stage uninterpreted). For example if the Input Assembler labels the xyzw components of one of the Elements it is producing as "position", but the first active Pipeline Stage is the Vertex Shader, the hardware ignores the label, since there is nothing for hardware to do with a "position" going into the Vertex Shader.

    Just because data is tagged as a System Interpreted Value, telling hardware what to do with it, does not mean the hardware necessarily "consumes" the data. Any data flowing through the Pipeline (System Interpreted Value or not) can typically be input into the next Pipeline Stage's Shader regardless of whether the hardware did something with the data in between. In other words, output data identified as a System Interpreted Value is available to the subsequent Shader Stage if it chooses to input the data, no differently from non-System Interpreted Values. If there are exceptions, they would be described in the System Interpreted Value Reference(24). One catch is that if a given Pipeline Stage, or the Input Assembler, identifies a System Interpreted Value (e.g. "clipDistance"), and the next Shader Stage declares it wants to input that value, it must not only declare as input the appropriate register # and component(s), but also identify the input as the same System Interpreted Value (e.g. "clipDistance"). Mismatching declarations results in undefined behavior. e.g. Identifying an output o3.x as "clipDistance", but not naming a declared input at the next stage v3.x as "clipDistance" is bad. Of course, in this example it would be legal for the subsequent Shader to not declare v3.x for input at all.

    It is not legal to declare a range of input or output registers as indexable(22.3.30) if any of the registers in the range contains a System Interpreted Value, with the exception of System Interpeted Values for the Tessellator, which have their own indexing rules - see the Hull Shader(10) specification.

    Note that there is no mechanism in the hardware to identify things that the hardware does not care about, such as "texture coordinate" or "color". At a high level in the software stack, full naming of all data may or may not be present to assist in authoring and/or discoverability, but these issues are outside the scope of anything that hardware or drivers need to know about.

    Note that while it may seem redundant to label System Interpreted Values at both the place producing the values as well as the next stage inputting it (in the case where the next stage actually wants to input it), this helps hardware/drivers isolate the compilation step for Shader programs at different Stages from any dependency on each other, in the event the driver needs to rename registers to fit hardware optimally, in a way that is transparent to the application.

    From the API point of view, System Generated Values and System Interpreted Values (above) may be exposed to developers as just once concept, "System Values" "SV_*".

    4.4.6 Element Alignment

    In many cases in D3D11.3, an offset for an Element is required, a stride for a structure (e.g. vertex) is required, or an initial offset for a Buffer is required. All of these types of values have the following alignment restrictions:

    Example byte alignments for some of the formats(19.1) which can be used in structures (e.g. vertex buffers) or as elements in index buffers:

    However, these alignment rules do not apply to Buffer offsets when creating Views on Buffers. These Buffer offsets have more stringent requirements, detailed in the View section(5.2).

    There is also some similar discussion, focused on memory accesses common to UAVs(5.3.9), SRVs and Thread Group Shared Memory in the Memory Addressing and Alignment Issues(7.13) section.

    None of these rules are validated (except in debug mode) and violations will result in undefined behavior.


    5 Resources


    Chapter Contents

    (back to top)

    5.1 Memory Structure
    5.2 Resource Views
    5.3 Resource Types and Pipeline Bindings
    5.4 Resource Creation
    5.5 Resource Dimensions
    5.6 Resource Manipulation
    5.7 Resource Discard
    5.8 Per-Resource Mipmap Clamping
    5.9 Tiled Resources


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    Several different Resource Types (arrangements of memory storage) are available for input or output by various Pipeline stages. The available Resource Types are: Buffer(5.3.4) (Typically a Structured(5.1.3) or "Unstructured(5.1.2) region of memory), Texture1D(5.3.5) (Homogeneous array of 1D Textures), Texture2D(5.3.6) (Homogeneous array of 2D Textures), Texture3D(5.3.7) (Volume Texture), and TextureCube(5.3.8) (3D enclosure). The Resource Type, in general, determines many characteristics, like whether the memory is Structured(5.1.3), where the Resource may be bound to in the graphics pipeline, how many mip levels there are, what the sampling behavior is, and other possible restrictions/properties on the Resource. Resources are built up of one or more Subresources, which each are a generalized 3D quantity of data which degenerates to store 2D and 1D quantities of data. The arrangement of Subresources to build up a Resource is tied to the Resource Type and dimensions.

    There are also distinctions in how a Resource is bound to the graphics pipeline. The binding location can also be thought of as accepting either Buffers directly or accepting Views of Resources. Each binding location which accepts Views requires a unique View type for that location - e.g. Render Target View or Shader Resource View.

    The size for mipmap slice subresources 1..n are computed sequentially from the size of the largest subresource (subresource 0, where for each mipped dimension:

            mipslice N+1 size = floor( mipslice N size / 2)
    

    The following diagram depicts Resources, their Subresource arrangement, and how they are sampled from within shaders. While the following diagram depicts deep mip mapping, it is valid to create Resources less than the maximum amount of mip levels.



    5.1 Memory Structure


    Section Contents

    (back to chapter)

    5.1.1 Overview
    5.1.2 Unstructured Memory
    5.1.3 Structured Buffers
    5.1.4 Raw Buffers
    5.1.5 Prestructured+Typeless Memory
    5.1.6 Prestructured+Typed Memory


    5.1.1 Overview

    When a Resource is allocated, it's memory structure can generally be classified either as Unstructured, Prestructured+Typeless, or Prestructured+Typed.

    5.1.2 Unstructured Memory

    Only the Buffer Resource(5.3.4) construction may be created as "Unstructured". Unstructured identifies the Resource as a single contiguous block of memory with no mipmaps, nor array slices. Unstructured Resources generally must have the memory structure defined when the Resource is bound to the graphics pipeline (providing types and offsets for the Element(s) in the Resource, as well as an overall stride). This memory structure can change freely, since it is late-bound to the Resource at the graphics pipeline binding location.

    The same Unstructured Resource may be bound to multiple slots in the graphics Pipeline with different memory interpretations at each location, as long as the Resource is only being read from at each binding. The same Unstructured Resource may not be bound to read and write stages of the pipeline simultaneously for a single Draw/Dispatch operation.

    Unstructured Resources do not have mipmaps nor array slices. See the Resource Binding Table(5.3.1) for descriptions of where Buffers (the only Resources that can be Unstructured) can be bound in the Pipeline.

    5.1.3 Structured Buffers

    Only the Buffer Resource(5.3.4) construction may be created as "Structured". Structured identifies the Resource as a single contiguous block of memory with no mipmaps, nor array slices, but it does have a structure size (stride), so that it represents an array of structures. Implementations can take advantage of knowing there is a fixed structure size in they way they lay out the memory physically (hidden from the application).

    A number of application scenarios require the ability to write a structure of data out to an index in an array. E.g. Generating an unordered collection of output data in an Append buffer(5.3.10). Hardware may be optimized for smaller reads and writes than the stride of a data. Consider a group of 16 shader threads where each thread wants to write out the first 4 bytes of a structure. If the structure is only 4 bytes, the 16 threads will collectively write out 16 consecutive 32-bit locations, which tends to be fast. But if the structure is larger – say 64 bytes, then the 16 threads will each issue a write that is spaced 64 bytes apart. Then when reading the data back in a later pass, the same problem will be reoccur. Reads will be issued with a spacing equal to the stride of the structure, with larger structures likely to have more of a performance issue.

    Due to the reads and the writes having similar access patterns it would be better to have the data layout in memory match the access pattern that occurs. Since the actual access pattern is hardware specific as well as the performance characteristics of reads spaced by stride boundaries, the design pattern of textures is followed to allow for better performance by hiding the physical layout of the memory.

    The same Structured Resource may be bound to multiple slots in the graphics Pipeline, as long as the Resource is only being read from at each binding. The same Structured Resource may not be bound to read and write stages of the pipeline simultaneously for a single Draw/Dispatch operation.

    Structured Resources do not have mipmaps nor array slices. See the Resource Binding Table(5.3.1) for descriptions of where Buffers (the only Resources that can be Structured) can be bound in the Pipeline.

    5.1.4 Raw Buffers

    Sometimes a convenient way to access the contents of a Buffer is to treat it simply as a huge bag of bits. The Raw view comes close to this, by allowing access to a Buffer in the form of 32-bit aligned addressing and accessing of data in chunks of 1-4 32-bit values, with no type.

    Raw access to a Buffer is indicated when creating either a Shader Resource View(5.2) (SRV) or Unordered Access View(5.3.9) (UAV), with the flag D3D11_BUFFER_SRV_FLAG_RAW (SRV) or D3D11_BUFFER_UAV_FLAG_RAW (UAV).

    To be able to create a RAW View, the underlying resource had to have been created with D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS.

    This flag cannot be combined with D3D11_RESOURCE_MISC_STRUCTURED_BUFFER. Also, a Buffer created with D3D11_BIND_CONSTANT_BUFFER cannot also specify D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS. This is not a limitation, since Constant Buffers already have a constraint that they cannot be accessed with any other View in the first place.

    Other than those invalid cases, specifying D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS when creating a Buffer does not limit any functionality versus not having it – e.g. the Buffer can be used for non-RAW access in any number of ways possible with D3D. Specifying the D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS flag only increases available functionality – it is just giving the system an early indication that the Buffer may participate in RAW style access in addition to other uses.

    5.1.5 Prestructured+Typeless Memory

    Any Resource type may be created as "Prestructured+Typeless". A structure size is provided, plus bit widths of components (but not the types of those components), and also dimensions (in units of structures) appropriate for the Resource type. This is unlike a Structured Buffer, which only specifies a structure size/stride and no definition of the contents of the structure. Before the Resource is bound to the pipeline, Resource Views must be created which will fully qualify the component's types. These Resource Views also allow the Resource to be decomposed into smaller compatible subgroupings of the Subresources. For example, a fully mipped DXGI_FORMAT_R32G32B32A32_TYPELESS Texture3D with a width of four, a height of three, and a depth of five, would have three mip levels. To use this texture, a Resource View would have to fully qualify the format of the Resource, possible to DXGI_FORMAT_R32G32B32A32_UINT. In addition, the Resource View could also regroup only the two least detailed mip levels or select only a particular mip level. This allows the original Resource to be manipulated as if it were a Resource made up of only a few Subresources within the original Resource. The full details of Resource Views(5.2) is described later.

    The benefit of Prestructured+Typeless Resources is that memory may be used as weakly typed storage, enabling limited reuse or reinterpretation of the memory, as long as the component bit counts remain the same. The same Prestructured+Typeless Resource may be bound to multiple slots in the graphics pipeline with Views of different fully qualified formats at each location. This forces bit representations of formats to be well-defined with respect to each other.

    For example, a Resource created with the format R32G32B32A32_TYPELESS may be used as R32G32B32A32_FLOAT and R32G32B32A32_UINT at different locations in the pipeline simultaneously.

    5.1.6 Prestructured+Typed Memory

    Any Resource type may be created as "Prestructured+Typed", also known as creating the Resource with a fully-qualified type or format. In general, this may allow Resource optimizations, especially when the Resource is created with flags indicating that the Resource cannot be Mapped/ Locked by the application.

    Special resource formats, such as Block Compression Formats(19.5), have the characteristic that in order to read an individual Element in the resource, there is not a unique location in the resource that corresponds to the Element. Some sort of decompression or decoding of data from locations in the resource that are not unique to a particular Element is required during the read process in order to resolve what an individual Element is (even when no filtering is being applied). Complex formats like this must be created as part of a "Prestructured+Typed" resource.

    "Prestructured+Typed" and "Prestructured+Typeless" resources support mipmapping, as the combination of Resource type, dimensions and structure size provided during resource creation supply enough information to allocate all memory in the layout required. Additionally, Resource Views created against Prestructured+Typed Resources must have indentical Resource Formats as the Prestructured+Typed Resource.


    5.2 Resource Views


    Section Contents

    (back to chapter)

    5.2.1 Overview
    5.2.2 Shader Resource View Support for Raw and Structured Buffers
    5.2.3 Clearing Views

    5.2.3.1 Clearing RenderTarget and DepthStencil Views
    5.2.3.2 Clearing Unordered Access Views
    5.2.3.3 Alternative: ClearView
    5.2.3.3.1 ClearView Rect mapping to surface area


    5.2.1 Overview

    In order to indirectly bind a Resource to certain stages of the graphics pipeline, Resource Views must be used. In addition, since some Resources may be created as "Prestructured+Typeless", the View provides the final opportunity to fully qualify the Resource component's types. The Resource Views also allow the Resource to be decomposed into smaller compatible subgroupings of the Mip Slices, Array Slices, and Subresources. This means that the effective dimensions and array sizes of the Views will, naturally, always be less than or equal to the original Resource. Each stage of the pipeline requires a unique type of View, and each type of View may have it's own custom set of state parameters that are needed to complete the process of binding a particular Resource to the graphics pipeline stage. All necessary restrictions to the basic Resource have already been done through the Pipeline Bind Flags during Resource creation. These Resource Views are directly bound to the pipeline, instead of the Resource objects, themselves.

    A resource view is distinct from the underlying resource from which the view was created, so where views are used, the view properties (number of mipmaps, number of array elements, type, etc.) are always used in place of the properties of the original resource. Thus, for example, a render target array index of zero always indicates the first array element in the view, even if the first array element in the view is not the first array element in the underlying resource. Out of range behaviors are also always with respect to the view properties where views are used.

    Each unique View type has certain restrictions associated with the bind location of the graphics pipeline stage. For example, Render Target Views of Buffers may have a maximum width of 16384. This maximum is smaller than the maximum size of a Buffer (min(max(128,0.25f * (Amount of Dedicated VRAM)),2048) MB), so only a subsection of large Buffers may be bound as a Render Target at a time. In addition, Render Target Views of Texture3D may have a maximum array size of 2048. This fortunately matches the maximum W dimension size of a Texture3D (2048).

    When Views are created of Buffers, restrictions are placed on the View's starting offset in the Buffer. If represented as a byte offset, the offset must be a multiple of the View Element Size. Another way to comply with this restriction is by specifying the Buffer offset in an integral number of View Elements. In addition, there exists another restriction on Buffer View creation. Views of the R32G32B32 element type cannot be created on a Buffer which had the Pipeline Bind flag of IAVERTEXINPUT, IAINDEXINPUT, CONSTANTBUFFER, or STREAMOUTPUT set. This prevents an R32G32B32 element from being used simultaneously as vertex and texture data.

    To characterize the kind of decomposition that Shader Resource Views are capable of, here's a complete listing of the number of Views that are possible with a Texture2D Resource that was created fully mipped with the most detailed LOD: width = 4, height = 4, arraysize = 3.

    1. MostDetailedMip: 0 (w:4,h:4), MipLevels: 3, FirstArraySlice: 0, ArraySize: 3
    2. MostDetailedMip: 0 (w:4,h:4), MipLevels: 3, FirstArraySlice: 0, ArraySize: 2
    3. MostDetailedMip: 0 (w:4,h:4), MipLevels: 3, FirstArraySlice: 1, ArraySize: 2
    4. MostDetailedMip: 0 (w:4,h:4), MipLevels: 3, FirstArraySlice: 0, ArraySize: 1
    5. MostDetailedMip: 0 (w:4,h:4), MipLevels: 3, FirstArraySlice: 1, ArraySize: 1
    6. MostDetailedMip: 0 (w:4,h:4), MipLevels: 3, FirstArraySlice: 2, ArraySize: 1
    7. MostDetailedMip: 0 (w:4,h:4), MipLevels: 2, FirstArraySlice: 0, ArraySize: 3
    8. MostDetailedMip: 0 (w:4,h:4), MipLevels: 2, FirstArraySlice: 0, ArraySize: 2
    9. MostDetailedMip: 0 (w:4,h:4), MipLevels: 2, FirstArraySlice: 1, ArraySize: 2
    10. MostDetailedMip: 0 (w:4,h:4), MipLevels: 2, FirstArraySlice: 0, ArraySize: 1
    11. MostDetailedMip: 0 (w:4,h:4), MipLevels: 2, FirstArraySlice: 1, ArraySize: 1
    12. MostDetailedMip: 0 (w:4,h:4), MipLevels: 2, FirstArraySlice: 2, ArraySize: 1
    13. MostDetailedMip: 1 (w:2,h:2), MipLevels: 2, FirstArraySlice: 0, ArraySize: 3
    14. MostDetailedMip: 1 (w:2,h:2), MipLevels: 2, FirstArraySlice: 0, ArraySize: 2
    15. MostDetailedMip: 1 (w:2,h:2), MipLevels: 2, FirstArraySlice: 1, ArraySize: 2
    16. MostDetailedMip: 1 (w:2,h:2), MipLevels: 2, FirstArraySlice: 0, ArraySize: 1
    17. MostDetailedMip: 1 (w:2,h:2), MipLevels: 2, FirstArraySlice: 1, ArraySize: 1
    18. MostDetailedMip: 1 (w:2,h:2), MipLevels: 2, FirstArraySlice: 2, ArraySize: 1
    19. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 0, ArraySize: 3
    20. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 0, ArraySize: 2
    21. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 1, ArraySize: 2
    22. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 0, ArraySize: 1
    23. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 1, ArraySize: 1
    24. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 2, ArraySize: 1
    25. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 0, ArraySize: 3
    26. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 0, ArraySize: 2
    27. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 1, ArraySize: 2
    28. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 0, ArraySize: 1
    29. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 1, ArraySize: 1
    30. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 2, ArraySize: 1
    31. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 0, ArraySize: 3
    32. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 0, ArraySize: 2
    33. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 1, ArraySize: 2
    34. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 0, ArraySize: 1
    35. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 1, ArraySize: 1
    36. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 2, ArraySize: 1

    The Views bound at the Render Target, Depth Stencil and Unordered Access binding locations in the pipeline have futher restrictions, in that they can only choose a Mip Slice, aka. select only one mip level. Here's a listing of the possible decomposition that can occur with Render Target, Depth Stencil and Unordered Access Views of the same Resource used in the previous example:

    1. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 0, ArraySize: 3
    2. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 0, ArraySize: 2
    3. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 1, ArraySize: 2
    4. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 0, ArraySize: 1
    5. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 1, ArraySize: 1
    6. MostDetailedMip: 0 (w:4,h:4), MipLevels: 1, FirstArraySlice: 2, ArraySize: 1
    7. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 0, ArraySize: 3
    8. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 0, ArraySize: 2
    9. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 1, ArraySize: 2
    10. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 0, ArraySize: 1
    11. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 1, ArraySize: 1
    12. MostDetailedMip: 1 (w:2,h:2), MipLevels: 1, FirstArraySlice: 2, ArraySize: 1
    13. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 0, ArraySize: 3
    14. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 0, ArraySize: 2
    15. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 1, ArraySize: 2
    16. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 0, ArraySize: 1
    17. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 1, ArraySize: 1
    18. MostDetailedMip: 2 (w:1,h:1), MipLevels: 1, FirstArraySlice: 2, ArraySize: 1

    5.2.2 Shader Resource View Support for Raw and Structured Buffers

    The following DDIs indicate the way Shader Resource Views (SRVs) are created, allowing read-only access to Raw and Structured Buffers in any shader stage.

    Making an SRV of a Raw buffer allows it to be declared for read in any shader stage by the ld_raw instruction. This is accomplished by specifying a flag on creation of the Buffer View requesting Raw access (D3D11_DDI_BUFFEREX_SRV_FLAG_RAW) shown below.

    In contrast, if the underlying Buffer was created as a Structured Buffer, then any SRV of the Buffer inherits the Structured semantics. In this case all shader stages can declare the resource for read by the ld_structured instruction. Note that unlike _RAW views (where the View decides that the Buffer will be "viewed" as RAW), nothing about the creation of a View of a Structured Buffer needs to indicate that it is structured, because once the Structured property is assigned to a Buffer on creation of the resource (including a structure stride), all Views on the Buffer are automatically Structured.

    typedef struct D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW
    {
        UINT     FirstElement;
        UINT     NumElements;
    } D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW;
    
    
    // BufferEx - Ex means extra pararameters
    typedef struct D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW
    {
        UINT     FirstElement;
        UINT     NumElements;
        UINT     Flags; // See D3D11_DDI_BUFFEREX_SRV_FLAG* below
    } D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW;
    #define D3D11_DDI_BUFFEREX_SRV_FLAG_RAW         0x00000001
    
    typedef struct D3D11DDIARG_CREATESHADERRESOURCEVIEW
    {
        D3D11DDI_HRESOURCE    hDrvResource;
        DXGI_FORMAT           Format;
        D3D11DDIRESOURCE_TYPE ResourceDimension;
    
        union
        {
            D3D11DDIARG_BUFFER_SHADERRESOURCEVIEW    Buffer;
            D3D11DDIARG_TEX1D_SHADERRESOURCEVIEW     Tex1D;
            D3D11DDIARG_TEX2D_SHADERRESOURCEVIEW     Tex2D;
            D3D11DDIARG_TEX3D_SHADERRESOURCEVIEW     Tex3D;
            D3D11DDIARG_TEXCUBE_SHADERRESOURCEVIEW   TexCube;
            D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW  BufferEx;
        };
    } D3D11DDIARG_CREATESHADERRESOURCEVIEW;
    

    5.2.3 Clearing Views

    Clearing is an optimized operation to enable filling Render Target, Depth Stencil and Unordered Access Views with certain clear values.

    5.2.3.1 Clearing RenderTarget and DepthStencil Views

    The floating point values passed in through the DDI must be converted to the fully qualified format type of the View desired to be cleared. The standard type conversion rules(3.2) indicate how to convert to most values; but these conversion rules do not explicitly handle the case where the destination fixed point format contains more integer bits than the floating point format mantissa. When converting these floating point values to a format such as DXGI_FORMAT_R32G32B32A32_UINT or _SINT, the closest value is chosen. When the original floating point absolute value is larger than 2^24, the least significant bits of the destination are to be filled with 0's for _UINT and positive _SINT; or 1's for negative _SINT values.

    The full extent of the resource view is always cleared. Viewport and scissor are not applied.

    Depth clear values outside of the range specified in viewport range(15.6.1) will not be passed to the DDI.

        // part of user mode Device interface:
        STDMETHOD_( void, ClearRenderTarget )(
            D3D10DDI_HDEVICE hDevice,
            D3D11DDI_HRENDERTARGETVIEW hRenderTargetView,
            FLOAT ColorRGBA[ 4 ] );
        STDMETHOD_( void, ClearDepthStencil )(
            D3D10DDI_HDEVICE hDevice,
            D3D11DDI_HDEPTHSTENCILVIEW hDepthStencilView,
            UINT DSFlags, FLOAT Depth, UINT8 Stencil );
    

    5.2.3.2 Clearing Unordered Access Views

    For UnorderedAccessViews(5.3.9), there are a couple of ways to Clear the View.

    ClearUnorderedAccessViewUint(...) clears a UAV with bit-precise values, copying the lower ni bits from each array element i to the corresponding channel, where ni is the number of bits in the ith channel of the resource Format (for example, R8G8B8_FLOAT has 8 bits for the first 3 channels). This works on any UAV with no format conversion. For RAW Buffer and Structured Buffer Views, only the first array element’s value is used.

    ClearUnorderedAccessViewFloat(...) clears a UAV with a float value. It only works on FLOAT, UNORM, and SNORM UAVs, with format conversion from FLOAT to *NORM where appropriate. On other UAVs, the operation is invalid and the call will not reach the driver.

        // part of user mode Device interface:
        STDMETHOD_( void, ClearUnorderedAccessViewUint)(
            D3D10DDI_HDEVICE hDevice,
            D3D11DDI_HRENDERTARGETVIEW hRenderTargetView,
            UINT Values[ 4 ] );
        STDMETHOD_( void, ClearUnorderedAccessViewFloat)(
            D3D10DDI_HDEVICE hDevice,
            D3D11DDI_HDEPTHSTENCILVIEW hDepthStencilView,
            FLOAT Values[ 4 ] );
    

    5.2.3.3 Alternative: ClearView

    View clearing command, implemented however the driver sees is the most efficient way. The primary distinction here versus the other Clears described above in D3D11 is that this takes a list of rects (an empty list clears the entire surface). This method only works on RTV, UAV, or any Video View of a Texture2D surface (runtime drops invalid calls). All array slices in the view get the same clear applied (any rects apply to each array slice).

    The driver or hardware is responsible for clamping rects to the surface extents.

    Color values are converted/clamped to the destination format as appropriate per D3D conversion rules. E.g. if the format of the view is R8G8B8A8_UNORM, inputs are clamped to 0.0f to 1.0f (NaN to 0).

    If the format is integer, such as R8G8B8A8_UINT, inputs are taken as integral floats, so 235.0f maps to 235 (fractions rounded to zero, out of range/INF values clamped to target range, NaN to 0).

    typedef VOID ( APIENTRY* PFND3D11_1DDI_CLEARVIEW )(
        D3D10DDI_HDEVICE hDevice,
        D3D11DDI_HANDLETYPE viewType, // View type that supports this clear
                                      // (RTV, UAV or any Video view).
                                      // Must be a Texture2D{Array} resource only
        VOID* hView,
        const FLOAT[4] color, // interpretation of color is view / format specific
        const D3D10_DDI_RECT* pRect, // Rect is subject to aligment constraints based on format being cleared.
                                    // e.g. Subsampled video formats require rect extents snapped to full sample boundary
                                    // NULL means clear the entire view.
        UINT numRects
         );
    
    Color Mappings for RTVs and UAVs:
    Color[0]: R
    Color[1]: G
    Color[2]: B
    Color[3]: A
    (e.g. An RTV of the Y plane of an NV12 surface, of format R8_*, would take the color from R.  An RTV of the UV plane of an NV12 surface, of format R8G8_*, would take the color from RG.)
    
    Color Mappings for Video Views:
    Color[0]: Y
    Color[1]: U/Cb
    Color[2]: V/Cr
    Color[3]: A
    

    For Video Views with YUV or YCbBr formats, no color space conversion happens – and in cases where the format name doesn’t indicate _UNORM vs. _UINT etc., _UINT is assumed (so input 235.0f maps to 235 as described above).

    This feature is required to be supported for all D3D10+ hardware in D3D11.1 drivers and for D3D9 drivers maps to the already existing functionality there. The D3D9 equivalent honored the scissor rect, so emulation of ClearView on the D3D9 DDI will unset scissor / clear / reset scissor to achieve the intended behavior of ClearView (e.g. this scissor manipulation isn't needed on the new D3D11.1 ClearView DDI which ignores scissor/viewports by definition.).

    Having this Clear with rects provides parity with D3D9 where there was a similar Clear that in particular was used for video. With Video added to D3D11 (outside the scope of this spec), adding this ClearView provides parity with D3D9.

    Direct2D will be another user of this for rendering scenarios that map to a fill.

    5.2.3.3.1 ClearView Rect mapping to surface area

    For RTVs and UAVS: The space the ClearView rects apply on is that of the view format (as opposed to the surface format, which for video surfaces can be different sizes). This is consistent with how Viewports and rendering work on those views. e.g. for a 64x64 YUYV surface, an RTV with the format R8G8B8A8_UINT appears in shaders (and to RSSetViewports()) as having dimensions 32x64 RGBA values. ClearView’s rects apply to the same space. The “color” coming into ClearView is just maps to the channels in the view (RGBA) ignoring the video layout. So a single clear color could really mean “stripes” of color if interpreted in video space. That’s not interesting to do, but it just falls out and isn’t worth bothering to validate out – the user who makes D3D views of video surfaces has to know they are operating on the raw memory via D3D – be it shaders or APIs like ClearView.

    By contrast, ClearView on Video Views (the views that are used with the video pipeline and not D3D Rasterization) operate on logical surface dimensions. So a 64x64 YUYV surface appears as though it is that size, and so rects passed into ClearView are in that full 64x64 space (not 32x64). It is undefined to request clearing non-aligned rects (covering only half of the pixel pairs). The color passed into ClearView is just a single YUV value that is appropriately replicated for subsampled pixels by the driver. Video Views hide the memory layout from the API user, so they do not have to worry about what type of subsampling is going on (an exception is the alignment of the rect bounds).


    5.3 Resource Types and Pipeline Bindings


    Section Contents

    (back to chapter)

    5.3.1 Overview
    5.3.2 Performant Readback
    5.3.3 Conversion Resource Copies/ Blts
    5.3.4 Buffer

    5.3.4.1 Buffer: Pipeline Binding: Input Assembler Vertex Data
    5.3.4.2 Buffer Pipeline Binding: Input Assembler Index Data
    5.3.4.3 Buffer Pipeline Binding: Shader Constant Input
    5.3.4.3.1 Partial Constant Buffer Updates
    5.3.4.3.2 Offsetting Constant Buffer Bindings
    5.3.4.4 Buffer Pipeline Binding: Shader Resource Input
    5.3.4.5 Pipeline Binding: Stream Output
    5.3.4.6 Pipeline Binding: RenderTarget Output
    5.3.4.7 Pipeline Binding: Unordered Access
    5.3.5 Texture1D
    5.3.5.1 Pipeline Binding: Shader Resource Input
    5.3.5.2 Pipeline Binding: RenderTarget Output
    5.3.5.3 Pipeline Binding: Depth/ Stencil Output
    5.3.6 Texture2D
    5.3.6.1 Pipeline Binding: Shader Resource Input
    5.3.6.2 Pipeline Binding: RenderTarget Output
    5.3.6.3 Pipeline Binding: Depth/ Stencil Output
    5.3.7 Texture3D
    5.3.7.1 Pipeline Binding: Shader Resource Input
    5.3.7.2 Pipeline Binding: RenderTarget Output
    5.3.8 TextureCube
    5.3.8.1 Pipeline Binding: Shader Resource Input
    5.3.8.2 Pipeline Binding: RenderTarget Output
    5.3.8.3 Pipeline Binding: Depth/ Stencil Output
    5.3.9 Unordered Access Views
    5.3.9.1 Creating the Underlying Resource for a UAV
    5.3.9.2 Creating an Unordered Access View (UAV) at the DDI
    5.3.9.3 Binding an Unordered Access View at the DDI
    5.3.9.4 Hazard Tracking
    5.3.9.5 Limitations on Typed UAVs
    5.3.10 Unordered Count and Append Buffers
    5.3.10.1 Creating Unordered Count and Append Buffers
    5.3.10.2 Using Unordered Count and Append Buffers
    5.3.11 Video Views


    5.3.1 Overview

    All Resources must be qualified with a set of Pipeline Bind flags at creation time to indicate where in the graphics pipeline the Resource may be bound. Binding a Resource at a certain pipeline location imposes certain restrictions on the Resource for it's entire lifetime. Naturally, Resources may be bound at more than one location in the pipeline (even simultaneously within certain restrictions), but the Resource must satisfy all the restrictions that each Pipeline Bind flag imposes. Certain pipeline locations only accept Resource Views(5.2) to be bound to them. In such a case, the presence of the Pipeline Bind flag indicates that Resource Views can be created against the Resource in order to bind the Resource to such a pipeline location. Sometimes Pipeline Bind flags impose restrictions which conflict with each other, so such Pipeline Usage flags are naturally mutually exclusive. Otherwise, explicit mention is given when one Pipeline Bind flag prevents the usage of other Pipeline Bind flags.

    The following table indicates which Resource Types may be bound to which available graphics Pipeline locations. A single entire Resource may not be able to have itself bound entirely to both an input and output Pipeline stage during a Draw operation. However, it is possible to refer to discrete components of the Resource, with Resource Views(5.2), allowing the same Resource to be bound as an input and output simultaneously, as long as the different Views do not share the same Subresources. For example: A two-dimensional mipped Resource created with the appropriate Pipeline Bind flags may have Subresources bound as Shader Resource Inputs, and a mutually exclusive Subresource from the same Resource bound as a RenderTarget Output, by using different Views.

    Resource Type Input Assembler Vertex or Index Shader Resource Input Shader Constant Input Stream Output RenderTarget Output Depth/ Stencil Output
    Buffer UVUUV
    Texture1D V VV
    Texture2D V VV
    Texture3D V V
    TextureCube V VV

    5.3.2 Performant Readback

    Any Resource that is used as an output for the graphics pipeline cannot be mapped/ locked. This is not meant to block an application from viewing the contents of such a Resource. It is expected that to read the contents of such Resources in a performant manner, the contents must be copied to a Resource which is able to be mapped/ locked for CPU read access. Typically, the Resource which is able to be mapped/ locked will not be marked with any Pipeline Bind flags, and as such is expected to be a driver allocated system memory Resource which is allocated in such a fashion to be compatible with the hardware DMA engine. The Resource is also expected to be allocated for performant CPU reads. This enables an asynchronous performant read back for the CPU.

    5.3.3 Conversion Resource Copies/ Blts

    The Performant Readback(5.3.2) scenario highlights the need that for any device-dependent memory arrangement, used to optimize GPU Resources which cannot be mapped/ locked, there is always a performant ability to convert the memory arrangement into the device-independent memory arrangement that will be used to satisfy the map/ lock. This principle also relates to input Resources that cannot be mapped/ locked. Since non-mappable/ non-lockable input Resources may use a device-dependent memory arrangement and still be updated with UpdateSubresourceUP(5.6.8), CopyResource(5.6.3), and CopySubresourceRegion(5.6.2). Therefore, there is a need for a performant ability to convert the device-indepenedent memory arrangement into any device-dependent memory arrangement.

    5.3.4 Buffer

    The Buffer is the only Resource which can be created as Unstructured(5.1.2). When the Buffer is bound to the graphics Pipeline, it's memory interpretation generally must also be bound to the graphics Pipeline along with it (providing types and offsets for the Element(s) in the Resource, as well as an overall stride). Sometimes this information is bound or described separately.

    A Buffer has neither multiple mip levels nor multiple array slices, so a Buffer is made up of only a single Subresource. Buffers can be bound at multiple places in the pipeline simulatenously during a Draw call as long the Buffer is only read from at each location. If the Buffer is being written to, then the Buffer may only be bound to one location in the pipeline during a Draw call.

    5.3.4.1 Buffer: Pipeline Binding: Input Assembler Vertex Data

    When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as an Input Assembler Vertex Input, the Buffer may be contain multiple types of data per vertex. This data type, offset, and stride binding is done when the Resource is bound to the Pipeline.

    5.3.4.2 Buffer Pipeline Binding: Input Assembler Index Data

    When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as an Input Assembler Index Input, and the Buffer is bound as an Index Input, at the time of binding, the format must be specified as one of: R16_UINT, or R32_UINT.

    5.3.4.3 Buffer Pipeline Binding: Shader Constant Input

    When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as a Shader Constant Input, the format of the Buffer is assumed to be R32G32B32A32_TYPELESS when bound as a Shader Constant Input. The Buffer size viewable from a shader is restricted to hold a maximum of 4096 elements. The overall buffer size can be larger - see Offsetting Constant Buffer Bindings(5.3.4.3.2). The usage of Constant Buffers within the shaders is expected to make Shader execution more efficient than using ld(22.4.6) or sample(22.4.15) with a Shader Resource within the Shader. Constant Input is read into a Shader given an integer array index to fetch a single Element. This is similar to point sampling of a texture; as there is no filtering. Constant Input is only needed to store Shader constants which could change between Draw() calls, as opposed to Immediate Constants or an Immediate Constant Buffer, which is are embedded into a Shader.

    A Shader Constant Resource is expected to be optimized for moving constant data from the CPU to the graphics adapter, and as such, may not be able to be mapped/ locked, allowing the CPU to read the contents of the Buffer directly. Therefore, the Resource may only be CPUWRITE (write-only) or not mappable/ lockable. In addition, if the Resource is mappable/ lockable, Map/ Lock must be called with DISCARDRESOURCE. NOOVERWRITE is not valid on Shader Constant Resources either. The Resource may still be used with CopyResource(5.6.3) and CopySubresourceRegion(5.6.2). All other Pipeline Bind flags are prevented from being used, disallowing constant buffers to be vertex buffers, streamed out to or rendered to, etc.

    5.3.4.3.1 Partial Constant Buffer Updates

    Map() allows NO_OVERWRITE for Constant Buffers. This was disallowed before D3D11.1.

    Similarly, UpdateSubresource1() adds the ability to perform partial Constant Buffer updates. So the pDstBox parameter does not have to be null NULL when updating Constant Buffers via UpdateSubresource1(). Either NO_OVERWRITE or DISCARD flags must be specified for a partial update, and the extents of the pDstBox parameter must be aligned to 16 byte (full constant) boundaries or the call is dropped.

    Before the first call with NO_OVERWRITE on a deferred context, a DISCARD must be done on the same context (via Copy*()/Update*()/Map() API flag or Discard*() API). This is not required on immediate contexts if the application knows the GPU is finished with the resource (though discard can be used if not).

    This feature is required to be supported for all D3D10+ hardware with D3D11.1 drivers.

    This allows applications to partially go back to a DX9 style convention where they have the ability to set invidivual constants in a Constant Buffer if they like (albeit with the new simplifying NO_OVERWRITE limitation - the updates can't conflict with existing constant references that may be in flight on the GPU). The restriction to not allow partial Constant Buffer updates when Constant Buffers were added to D3D10 was intended to simplify the system handling of shader constants on the assumption that applications could simply organize their constant data in to groups, each with its own Constant Buffer, organized by frequency of update. The impression seems to be that in many cases this restriction was a net performance loss for applications, hence this proposed change to at least partially loosen up Constant Buffer updates.

    5.3.4.3.2 Offsetting Constant Buffer Bindings

    A common desire for high performance game engines is to collect a large batch of Constant Buffer updates for constants to be referenced by separate Draw*() calls, each needing their own constants, all at once. This is facilitated by allowing the application to create a large Buffer and then pointing individual shaders to regions within it (kind of like a View, but without having to make a whole object to describe the view).

    Constant Buffers are allowed to be created larger than the maximum Constant Buffer size that an individual shader can reference, which is at most 4096 16-byte elements - 65kB. Each "element" is one 4-component Shader Constant.

    The Constant Buffer Resource size is limited only by the size of memory allocation the system is capable of handling (limits defined elsewhere, and more than large enough for the purpose of the discussion here).

    When a Constant Buffer larger than 4096 elements in size is bound to the pipeline via *SetShaderConstants() APIs [e.g. VSSetShaderConstants()], it appears to the shader as if it is only 4096 elements in size.

    Variants of the *SetShaderConstants() APIs, *SetShaderConstants1() allow a "FirstConstant" and "NumConstants" to be specified along with the binding. When the shader accesses a Constant Buffer bound this way it will appear as if it starts at the specified "FirstConstant" offset (where 1 means 16 bytes) and has a size defined by NumConstants (number of 16 byte Constants). This is basically a lightweight "View" of a region of a larger Constant Buffer.

    FirstConstant must be a multiple of 16 constants.

    NumConstants must be a multiple of 16 constants, in the range [0..4096].

    If any part of the range defined by FirstConstant and ConstantCount falls off the underlying resource, accesses to those addresses count as out of bounds reads from the shader, which is defined to return 0 for all components.

    This feature is required to be supported for all D3D10+ hardware in D3D11.1 drivers and is emulated by the runtime on Feature Level 9_x running on D3D9 drivers.

    5.3.4.4 Buffer Pipeline Binding: Shader Resource Input

    When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input and it is a typed Buffer (the view specifies a format type), it may be read from within shaders with the load(22.4.6). See the description of this instruction for detail. To use a typed Buffer as a Shader Resource Input, it must be bound at one of the available 128 slots for input Resources, by first creating the appropriate View for this particular stage of the graphics pipeline. It is fine for the same Buffer to be bound to multiple slots simultaneously, possibly even with different Element formats or inital offsets. However at each binding, only a single Element type is permitted, and the data stride is implied to be equal the Element size. In other words, "Array-of-structure" style layouts cannot be described for typed Buffers bound at Shader Resource Input. Structured Buffers allow array-of-structures access, though without any automatic format conversion for elements.

    Just like Typed Buffers, Raw and Structured Buffers can be bound to the pipeline via Shader Resource Views for reading into shaders via ld_raw(22.4.10) and ld_structured(22.4.12) instructions, respectively.

    5.3.4.5 Pipeline Binding: Stream Output

    Details of the usage of such a Resource are described in the Streaming Output section(14). There are two types of bindings available for Stream Output Buffers, one that treats a single output Buffer as a Multiple-Element Buffer (array-of-structures), while the other permits multiple output Buffers each treated as Single-Element Buffers (structure-of-arrays). Single-Element Buffer output is expected to be used typically for recirculation (subsequently) as a Shader Resource Input, but this can also be used as Input Assembler Vertex Input. Multiple-Element Buffer output is only intended to be used for recirculating data (subsequently) back as Input Assembler Vertex Input (since Multiple-Element Buffer access is not currently available in Shaders).

    If the Resource has the Input Assembler Vertex Input Pipeline Bind flag specified, the Resource may also be used with DrawAuto(8.9).

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.4.6 Pipeline Binding: RenderTarget Output

    When a Buffer has been created with the Pipeline Bind flag indicating that it may be used as a RenderTarget Output, this Pipeline Bind flag indicates that Render Target Views may be created with this Resource.

    Constraints when a Buffer is used as RenderTarget output: it cannot be paired with any Depth/Stencil Output (i.e. no depth buffering); it can only have a single Element defined, with a data stride implied to be equal to the Element width; the View is limited to a maximum width of 16384 (multiple Views with different offsets would be needed to leverage the entire Buffer). In all other regards, a Buffer render target output is identical to the Texture1D case.

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.4.7 Pipeline Binding: Unordered Access

    When the Unordered Access Pipeline Bind has been indicated, Unordered Access Views may be created for use at the Compute Shader or Pixel Shader.

    5.3.5 Texture1D

    A Texture1D is a homogeneous array of 1D Textures. The array is homogeneous in the sense that each Texture has the same data format and dimensions (including miplevels). The entire array of Textures are created atomically. The memory for the entire Resource need not be contiguous. A Texture1D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a Texture1D may be decomposed into sub-groups of Mip Slices, Array Slices, and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.

    Like other Resources, a Texture1D must be qualified with a set of flags at creation indicating where in the graphics pipeline the Resource may be bound. Naturally, the Resource may be bound at more than one location in the pipeline, but the Resource must've been created with the restrictions that each Pipeline Usage flag indicates. Sometimes Pipeline Bind flags have restrictions which conflict with each other, so such Pipeline Bind flags are mutually exclusive.

    5.3.5.1 Pipeline Binding: Shader Resource Input

    When the Texture1D has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the Texture1D Resource may be read from within shaders with the ld(22.4.6) or sample(22.4.15) instructions, after they are bound to the pipeline through the usage of Views. See the descriptions of these instructions for details. Each Element from a Texture1D to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). Texture1D Resources are addressed from the Shader with a 1D coordinate plus a 2nd coordinate specifying which Array Slice in the Texture1D to fetch from. The 2nd coordinate, if provided as floating point data, is rounded (nearest even), producing an integral array index. Typical 1D filtering occurs on the Array Slice chosen by the 2nd coordinate.

    5.3.5.2 Pipeline Binding: RenderTarget Output

    When a Texture1D Mip Slice is bound as a RenderTarget Output, through the usage of Views, it is allowable to use either an accompanying Texture1D Depth/ Stencil of the same dimensions. For example, if the most detailed Mip Slice View of a Texture1D (width=6, arraysize=8) is bound as a RenderTarget Output; an effective Texture1D View of (width=6, arraysize=8) may be used as a Depth/ Stencil. Also, the particular Array Slice in the Texture1D to render is chosen, from the Geometry Shader stage, by declaring a scalar component output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render to Array Slice 0.

    Rasterization to Texture1D resources is identical to rasterizing to a Texture2D resource with a y dimension of 1, thus both x and y coordinates are honored and only rendering that covers the Nx1 area of these resources will update them.

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.5.3 Pipeline Binding: Depth/ Stencil Output

    When the Texture1D has been created with the Pipeline Bind flag indicating that it may be used as a Depth/ Stencil Output, the Texture1D Resource may only be one of a few Resource Formats (essentially only those which have a 'D' component or those TYPELESS formats which can be converted to a format with a 'D' component), such as D32_FLOAT or R32_TYPELESS, etc.

    Resources created with this Pipeline Bind flag cannot also be used as a RenderTarget (the two flags are mutually exclusive).

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. In addition, Depth/ Stencil Resources cannot be a destination for CopyResource(5.6.3), CopySubresourceRegion(5.6.2), nor UpdateSubresourceUP(5.6.8) operations. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.6 Texture2D

    A Texture2D is a homogeneous array of 2D Textures. The array is homogeneous in the sense that each Texture has the same data format and dimensions (including miplevels). The entire array of Textures are created atomically. The memory for the entire Resource need not be contiguous. A Texture2D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a Texture2D may be decomposed into sub-groups of Mip Slices, Array Slices, and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.

    Like other Resources, a Texture2D must be qualified with a set of flags at creation indicating where in the graphics Pipeline the Resource may be bound. Naturally, the Resource may be bound at more than one location in the Pipeline, but the Resource must've been created with the restrictions that each Pipeline Bind flag indicates. Sometimes Pipeline Bind flags have restrictions which conflict with each other, so such Pipeline Bind flags are mutually exclusive.

    5.3.6.1 Pipeline Binding: Shader Resource Input

    When the Texture2D has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the Texture2D Resource may be read from within shaders with the ld(22.4.6) or sample(22.4.15) instructions, after they are bound to the pipeline through the usage of Views. See the descriptions of these instructions for details. Each Element from a Texture2D to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). Texture2D Resources are addressed from the Shader with a 2D coordinate plus a 3rd coordinate specifying which Array Slice in the Texture2D to fetch from. The 3rd coordinate, if provided as floating point data, is rounded (nearest even), producing an integral array index. Typical 2D filtering occurs on the Array Slice chosen by the 3rd coordinate.

    5.3.6.2 Pipeline Binding: RenderTarget Output

    When a Texture2D Mip Slice View is bound as a RenderTarget Output, through the usage of Views, it is allowable to use either an accompanying effective Texture2D Depth/ Stencil View of the same dimensions. For example, if the most detailed Mip Slice View of a Texture2D (width=6, height=4, arraysize=8) is bound as a RenderTarget Output; an effective Texture2D View of (width=6, height=4, arraysize=8) may be used as a Depth/ Stencil. Also, the particular Array Slice in the Texture2D to render is chosen, from the Geometry Shader stage, by declaring a scalar component of output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render to Array Slice 0.

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.6.3 Pipeline Binding: Depth/ Stencil Output

    When the Texture2D has been created with the Pipeline Bind flag indicating that it may be used as a Depth/ Stencil Output, the Texture2D Resource may only be one of a few Resource Formats (essentially only those which have a 'D' component or those TYPELESS formats which can be converted to a format with a 'D' component), such as D32_FLOAT or R32_TYPELESS, etc.

    Resources created with this Pipeline Bind flag cannot also be used as a RenderTarget (the two flags are mutually exclusive).

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. In addition, Depth/ Stencil Resources cannot be a destination for CopyResource(5.6.3), CopySubresourceRegion(5.6.2), nor UpdateSubresourceUP(5.6.8) operations. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.7 Texture3D

    A Texture3D is a 3D grid data layout, supporting mipmaps; and is also known as a Volume Texture. The entire Resource is created atomically. The memory for the entire Resource need not be contiguous. A Texture3D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a Texture3D may be decomposed into sub-groups of Mip Slices, Array Slices, and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.

    5.3.7.1 Pipeline Binding: Shader Resource Input

    When the Texture3D has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the Texture3D Resource may be read from within shaders with the ld(22.4.6) or sample(22.4.15) instructions, after they are bound to the pipeline through the usage of Views. See the descriptions of these instructions for details. Each Element from a Texture3D to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). Texture3D Resources are addressed from the Shader with a 3D coordinate. Typical 3D filtering occurs with this coordinate.

    5.3.7.2 Pipeline Binding: RenderTarget Output

    When a Texture3D Mip Slice is bound as a RenderTarget Output, through the usage of Views, the Texture3D behaves identically to a Texture2D with n Array Slices where n is the depth (3rd dimension) of the Texture3D. The particular z slice in the Texture3D to render is chosen, from the Geometry Shader stage stage, by declaring a scalar component of output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render\ to z=0.

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.8 TextureCube

    A TextureCube has 6 faces, each of which is like a square Texture2D, including mipmaps. The entire Resource is created atomically. The memory for the entire Resource need not be contiguous. A Texture3D may not be created as Unstructured(5.1.2), but may be created as Prestructured+Typeless Memory(5.1.5) or as Prestructured+Typed Memory(5.1.6). As illustrated by the diagram(5) and binding configurations(5.3.1), a TextureCube may be decomposed into sub-groups of Mip Slices, Array Slices (each representing a face), and Subresources in order to refer to discrete components of the Resource to accomplish certain operations. The decomposition for graphics Pipeline Binding is achieved through the usage of Views for each stage of the pipeline.

    TextureCubes can also represent an array of cubes, which means a multiple of 6 faces. Used as a Cube Array, the "array" dimension selects which Cube to use. However, the same resource can also be viewed as a 2D Array, in which case each face of each Cube appears as a single location along the "array" dimension.

    5.3.8.1 Pipeline Binding: Shader Resource Input

    When the TextureCube has been created with the Pipeline Bind flag indicating that it may be used as a Shader Resource Input, the TextureCube{Array} Resource may be read from within shaders after they are bound to the pipeline through the usage of Views. The View can expose the TextureCube{Array} as an array of TextureCubes starting from any face (from the perspective of a sequence of 2D faces), then spanning a multiple of 6 faces, such that each 6 faces appears as a location on the array axis. Alternatively, the TextureCube can be viewed as a 2D Array spanning any contiguous set of faces in the resource where each face is a slice, hiding the "Cube-ness" of the resource. Each Element from a TextureCube resource to be read into a Shader counts towards a limit on the total number of elements addressable from Resources (128). TextureCube Resources viewed as a Cube are addressed from the Shader with a 3D vector pointing out from the center of the TextureCube, and as a Cube Array, an additional coordinate provides the Array Slice. If the Array Slice is provided as a floating point number, is is rounded to nearest even.

    5.3.8.2 Pipeline Binding: RenderTarget Output

    When a TextureCube{Array} Mip Slice is bound as a RenderTarget Output, the TextureCube behaves identically to a Texture2DArray, such that any contiguous subset of the faces in the array participate in the View. The particular Array slice in the View to render to is chosen from the Geometry Shader stage, by declaring a scalar component of output data as the System Interpreted Value "renderTargetArrayIndex". If such a value is not present in primitive data reaching the rasterizer, the default is to render to Array Slice0.

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    5.3.8.3 Pipeline Binding: Depth/ Stencil Output

    When the TextureCube{Array} has been created with the Pipeline Bind flag indicating that it may be used as a Depth/ Stencil Output, the Resource may only be one of a few Resource Formats (essentially only those which have a 'D' component or those TYPELESS formats which can be converted to a format with a 'D' component), such as D32_FLOAT or R32_TYPELESS, etc. In addition, when rendering using such a Depth/ Stencil TextureCube (viewed as a Texture2DArray Depth Stencil View), only equally sized RenderTarget Views are compatable for use as a RenderTarget Output.

    Resources created with this Pipeline Bind flag cannot also be used as a RenderTarget (the two flags are mutually exclusive).

    Since this is an output stage, Resources with this Pipeline Bind flag are not able to be mapped/ locked for CPU access ever. In addition, Depth/ Stencil Resources cannot be a destination for CopyResource(5.6.3), CopySubresourceRegion(5.6.2), nor UpdateSubresourceUP(5.6.8) operations. This doesn't prevent Resources completely from being viewed by the CPU, as there are performant(5.3.2) methods for viewing the contents of the Resource.

    typedef struct D3D10DDI_HSHADERRESOURCEVIEW
    {
        void* m_pDrvPrivate;
    } D3D10DDI_HSHADERRESOURCEVIEW;
    
    typedef struct D3D10DDIARG_BUFFER_SHADERRESOURCEVIEW
    {
        union
        {
            UINT FirstElement; // Nicer name // < ResourceWidth / ElementSize
            UINT ElementOffset;
        };
        union
        {
            UINT NumElements; // Nicer name // <= ( ResourceWidth / ElementSize - ElementOffset )
            UINT ElementWidth;
        };
    } D3D10DDIARG_BUFFER_SHADERRESOURCEVIEW;
    
    typedef struct D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW
    {
        union
        {
            UINT FirstElement;  // Nicer name   // < ResourceWidth / ElementSize
            UINT ElementOffset;
        };
        union
        {
            UINT NumElements;   // Nicer name // <= ( ResourceWidth / ElementSize - ElementOffset )
            UINT ElementWidth;
        };
        UINT     Flags; // See D3D11_DDI_BUFFEREX_SRV_FLAG_* below
    } D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW;
    #define D3D11_DDI_BUFFEREX_SRV_FLAG_RAW         0x00000001
    
    
    typedef struct D3D10DDIARG_TEX1D_SHADERRESOURCEVIEW
    {
        UINT     MostDetailedMip; // < Resource MipLevels
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     MipLevels; // <= ( Resource MipLevels - MostDetailedMip )
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D10DDIARG_TEX1D_SHADERRESOURCEVIEW;
    
    typedef struct D3D10DDIARG_TEX2D_SHADERRESOURCEVIEW
    {
        UINT     MostDetailedMip; // < Resource MipLevels
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     MipLevels; // <= ( Resource MipLevels - MostDetailedMip )
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D10DDIARG_TEX2D_SHADERRESOURCEVIEW;
    
    typedef struct D3D10DDIARG_TEX3D_SHADERRESOURCEVIEW
    {
        UINT     MostDetailedMip; // < Resource MipLevels
        UINT     MipLevels; // <= ( Resource MipLevels - MostDetailedMip )
    } D3D10DDIARG_TEX3D_SHADERRESOURCEVIEW;
    
    
    typedef struct D3D10DDIARG_TEXCUBE_SHADERRESOURCEVIEW
    {
        UINT     MostDetailedMip;
        UINT     MipLevels;
    } D3D10DDIARG_TEXCUBE_SHADERRESOURCEVIEW;
    
    typedef struct D3D10_1DDIARG_TEXCUBE_SHADERRESOURCEVIEW
    {
        UINT MostDetailedMip; // < Resource MipLevels
        UINT MipLevels; // <= ( Resource MipLevels - MostDetailedMip )
        UINT First2DArrayFace; // <= ( Resource ArraySize - 5 )
        UINT NumCubes; // multiple of 6 faces that must fit in resource after First2DArrayFace
    } D3D10_1DDIARG_TEXCUBE_SHADERRESOURCEVIEW;
    
    typedef struct D3D11DDIARG_CREATESHADERRESOURCEVIEW
    {
        D3D10DDI_HRESOURCE    hDrvResource;
        DXGI_FORMAT           Format; // Fully qualified
        D3D10DDIRESOURCE_TYPE ResourceDimension;
    
        union
        {
            D3D10DDIARG_BUFFER_SHADERRESOURCEVIEW    Buffer;
            D3D10DDIARG_TEX1D_SHADERRESOURCEVIEW     Tex1D;
            D3D10DDIARG_TEX2D_SHADERRESOURCEVIEW     Tex2D;
            D3D10DDIARG_TEX3D_SHADERRESOURCEVIEW     Tex3D;
            D3D10_1DDIARG_TEXCUBE_SHADERRESOURCEVIEW TexCube;
            D3D11DDIARG_BUFFEREX_SHADERRESOURCEVIEW  BufferEx;
        };
    } D3D11DDIARG_CREATESHADERRESOURCEVIEW;
    
        // part of user mode Device interface:
        STDMETHOD_( SIZE_T, CalcPrivateShaderResourceViewSize )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATESHADERRESOURCEVIEW* pCreateShaderResourceView );
        STDMETHOD( CreateShaderResourceView )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATESHADERRESOURCEVIEW* pCreateShaderResourceView,
            D3D10DDI_HSHADERRESOURCEVIEW hDrvShaderResourceView );
        STDMETHOD_( void, DestroyShaderInput )( D3D10DDI_HDEVICE hDrvDevice,
            D3D10DDI_HSHADERRESOURCEVIEW hDrvShaderResourceView );
    
    
    typedef struct D3D10DDI_HRENDERTARGETVIEW
    {
        void* m_pDrvPrivate;
    } D3D10DDI_HRENDERTARGETVIEW;
    
    typedef struct D3D10DDIARG_BUFFER_RENDERTARGETVIEW
    {
        union
        {
            UINT FirstElement; // Nicer name // < ResourceWidth / ElementSize
            UINT ElementOffset;
        };
        union
        {
            UINT NumElements; // Nicer name // <= ( ResourceWidth / ElementSize - ElementOffset )
            UINT ElementWidth;
        };
    } D3D10DDIARG_BUFFER_RENDERTARGETVIEW;
    
    typedef struct D3D10DDIARG_TEX1D_RENDERTARGETVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D10DDIARG_TEX1D_RENDERTARGETVIEW;
    
    typedef struct D3D10DDIARG_TEX2D_RENDERTARGETVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D10DDIARG_TEX2D_RENDERTARGETVIEW;
    
    typedef struct D3D10DDIARG_TEX3D_RENDERTARGETVIEW
    {
        UINT     MipSlice;
        UINT     FirstW; // < Resource MipSlice W dimension
        UINT     WSize; // <= ( Resource MipSlice W dimension - FirstW )
    } D3D10DDIARG_TEX3D_RENDERTARGETVIEW;
    
    typedef struct D3D10DDIARG_TEXCUBE_RENDERTARGETVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // as 2DArray
        UINT     ArraySize; // as 2DArray
    } D3D10DDIARG_TEXCUBE_RENDERTARGETVIEW;
    
    typedef struct D3D10DDIARG_CREATERENDERTARGETVIEW
    {
        D3D10DDI_HRESOURCE    hDrvResource;
        DXGI_FORMAT           Format; // Fully qualified
        D3D10DDIRESOURCE_TYPE ResourceDimension;
    
        union
        {
            D3D10DDIARG_BUFFER_RENDERTARGETVIEW  Buffer;
            D3D10DDIARG_TEX1D_RENDERTARGETVIEW   Tex1D;
            D3D10DDIARG_TEX2D_RENDERTARGETVIEW   Tex2D;
            D3D10DDIARG_TEX3D_RENDERTARGETVIEW   Tex3D;
            D3D10DDIARG_TEXCUBE_RENDERTARGETVIEW TexCube;
        };
    } D3D10DDIARG_CREATERENDERTARGETVIEW;
    
        // part of user mode Device interface:
        STDMETHOD_( SIZE_T, CalcPrivateRenderTargetViewSize )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D10DDIARG_CREATERENDERTARGETVIEW* pCreateRenderTargetView );
        STDMETHOD( CreateRenderTargetView )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D10DDIARG_CREATERENDERTARGETVIEW* pCreateRenderTargetView,
            D3D10DDI_HRENDERTARGETVIEW hDrvRenderTargetView );
        STDMETHOD_( void, DestroyRenderTargetView )( D3D10DDI_HDEVICE hDrvDevice,
            D3D10DDI_HRENDERTARGETVIEW hDrvRenderTargetView );
    
    typedef struct D3D10DDI_HDEPTHSTENCILVIEW
    {
        void* m_pDrvPrivate;
    } D3D10DDI_HDEPTHSTENCILVIEW;
    
    typedef struct D3D10DDIARG_TEX1D_DEPTHSTENCILVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D10DDIARG_TEX1D_DEPTHSTENCILVIEW;
    
    typedef struct D3D10DDIARG_TEX2D_DEPTHSTENCILVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D10DDIARG_TEX2D_DEPTHSTENCILVIEW;
    
    typedef struct D3D10DDIARG_TEXCUBE_DEPTHSTENCILVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // as 2DArray
        UINT     ArraySize; // as 2DArray
    } D3D10DDIARG_TEXCUBE_DEPTHSTENCILVIEW;
    
    typedef enum D3D11_DDI_CREATEDEPTHSTENCILVIEW_FLAG
    {
        D3D11_DDI_CREATE_DSV_READ_ONLY_DEPTH   = 0x01L,
        D3D11_DDI_CREATE_DSV_READ_ONLY_STENCIL = 0x02L,
        D3D11_DDI_CREATE_DSV_FLAG_MASK         = 0x03L,
    } D3D11_DDI_CREATEDEPTHSTENCILVIEW_FLAG;
    
    typedef struct D3D11DDIARG_CREATEDEPTHSTENCILVIEW
    {
        D3D10DDI_HRESOURCE    hDrvResource;
        DXGI_FORMAT           Format; // Fully qualified
        D3D10DDIRESOURCE_TYPE ResourceDimension;
        UINT                  Flags;
    
        union
        {
            D3D10DDIARG_TEX1D_DEPTHSTENCILVIEW   Tex1D;
            D3D10DDIARG_TEX2D_DEPTHSTENCILVIEW   Tex2D;
            D3D10DDIARG_TEXCUBE_DEPTHSTENCILVIEW TexCube;
        };
    } D3D11DDIARG_CREATEDEPTHSTENCILVIEW;
    
        // part of user mode Device interface:
        STDMETHOD_( SIZE_T, CalcPrivateDepthStencilViewSize )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATEDEPTHSTENCILVIEW* pCreateDepthStencilView );
        STDMETHOD( CreateDepthStencilView )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATEDEPTHSTENCILVIEW* pCreateDepthStencilView,
            D3D10DDI_HDEPTHSTENCILVIEW hDrvDepthStencilView );
        STDMETHOD_( void, DestroyDepthStencilView )( D3D10DDI_HDEVICE hDrvDevice,
            D3D10DDI_HDEPTHSTENCILVIEW hDrvDepthStencilView );
    
    typedef struct D3D11DDI_HUNORDEREDACCESSVIEW
    {
        void* m_pDrvPrivate;
    } D3D11DDI_HUNORDEREDACCESSVIEW;
    
    typedef struct D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW
    {
        UINT     FirstElement; // < ResourceWidth / ElementSize
        UINT     NumElements; // <= ( ResourceWidth / ElementSize - ElementOffset )
        UINT     Flags; // See D3D11_DDI_BUFFER_UAV_FLAG* below
    } D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW;
    #define D3D11_DDI_BUFFER_UAV_FLAG_RAW         0x00000001
    #define D3D11_DDI_BUFFER_UAV_FLAG_APPEND      0x00000002
    #define D3D11_DDI_BUFFER_UAV_FLAG_COUNTER     0x00000004
    
    typedef struct D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW;
    
    typedef struct D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice; // < Resource ArraySize
        UINT     ArraySize; // <= ( Resource ArraySize - FirstArraySlice )
    } D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW;
    
    typedef struct D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW
    {
        UINT     MipSlice;
        UINT     FirstW; // < Resource MipSlice W dimension
        UINT     WSize; // <= ( Resource MipSlice W dimension - FirstW )
    } D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW;
    
    typedef struct D3D11DDIARG_CREATEUNORDEREDACCESSVIEW
    {
        D3D10DDI_HRESOURCE    hDrvResource;
        DXGI_FORMAT           Format; // Fully qualified
        D3D10DDIRESOURCE_TYPE ResourceDimension; // Runtime will never set this to TexCube
    
        union
        {
            D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW    Buffer;
            D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW     Tex1D;
            D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW     Tex2D;
            D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW     Tex3D;
        };
    } D3D11DDIARG_CREATEUNORDEREDACCESSVIEW;
    
        // part of user mode Device interface:
        STDMETHOD_( SIZE_T, CalcPrivateUnorderedAccessViewSize )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATEUNORDEREDACCESS* pCreateUnorderedAccessView );
        STDMETHOD( CreateUnorderedAccessView )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATEUNORDEREDACCESSVIEW* pCreateUnorderedAccessView,
            D3D10DDI_HUNORDEREDACCESSVIEW hDrvUnorderedAccessView );
        STDMETHOD_( void, DestroyDepthStencilView )( D3D10DDI_HDEVICE hDrvDevice,
            D3D10DDI_HUNORDEREDACCESSVIEW hDrvUnorderedAccessView );
    
    

    5.3.9 Unordered Access Views

    Unordered Access Views (UAVs) can be bound at the Output Merger(17) (available to all graphics shader stages from there) and Compute Shader(18) stage.

    At the Output Merger, there is the constraint that the total of the number of o# slots (Render Target Views - RTVs) and u# slots (UAVs) that may be bound simultaneously is at most 64, where no more than 8 can be RTVs. The way this is enforced, for simplicity, is that all o# (RTV) slots that are declared must have a slot # that is less than the minimum # of the u# (UAV) slots that are declared. So it is valid for a Pixel Shader to declare o0, o1, u4 and u63, but it is not valid for a Pixel Shader to declare o0, u3, and o4.

    Separating o# from u# this way minimizes future dependence on the fact that they happen to live in the same bind space in D3D11, if that turns out not to be desirable.

    The UAVs bound at the Output Merger are visible to all graphics stages (a shared set of UAV bindings). So multiple graphics shader stages can access the same UAVs simultaneously.

    Certain shader stages, like the Vertex Shader or Domain Shader (with Tessellation), are implemented by hardware using shader result caches. So if nearby primitives share the same vertex, the results of the corresponding shader invocation for that vertex may be retrieved from a result cache rather than re-executing the shader. The presence of these result caches and their behavior is hardware specific. Previously, without the ability for the unique shader invocations to have side-effects, the user had no way of knowing or depending on any caching taking place, beyond observing some performance wins if the caching worked well. With UAVs available to all shaders (enabling shaders to write arbitrarily to the UAV memory), any hardware-specific shader result caching will be visible, and the burden is left to the application developer to avoid depending on any given hardware's behavior. In particular, the behavior of such caching would not take into account any UAV accesses that take place; the hash key for shader result caching is simply the inputs for a given shader invocation independent of what may be read from UAVs during the shader invocation (which may not occur at all if there is a cache hit).

    There is no guarantee that UAV accesses issued from within or across shader stages executing within a given Draw*(), or issued from the Compute Shader within Dispatch*(), finish in the order issued. All UAV accesses are finished at the end of the Draw*()/Dispatch*() though.

    The Compute Shader has its own separate set of 64 slots where only UAVs may be bound, independent of the set of RTV+UAV bindpoints for the graphics stages.

    In D3D11.0, the number of UAVs was limited to 8 at the Compute Shader and 8 combined RTV+UAV at the Pixel Shader. There have since been requests to increase this limit. In addition, there have been requests to have some sort of logging ability available to all shader stages, at least for debugging purposes. Being able to access UAVs from every graphics Shader Stage permits this.

    Dynamic indexing of UAV registers (i.e. dynamically indexing # in u#) is not permitted.

    Shader Instructions (defined elsewhere) which are accessing UAVs simply take a u# as a parameter, much like instructions that are sampling from textures take a t# as a parameter.

    5.3.9.1 Creating the Underlying Resource for a UAV

    The D3D11 Resource types that can have a UAV on them are Texture1D{Array}, Texture2D{Array}, Texture3D and Buffer. When the Resource is created at the API/DDI, the bind flag D3D11_{DDI_}BIND_UNORDERED_ACCESS must be specified in order for subsequent creation of UAVs on the resource to be valid.

    The D3D11_BIND_UNORDERED_ACCESS flag may be combined with any of the following bind flags:

    The D3D11_BIND_UNORDERED_ACCESS flag may NOT be combined with any of the following bind flags:

    The constraints combining D3D11_BIND_UNORDERED_ACCESS with other flags on Resource Creation, such as Usage (dynamic, staging etc) are the same as existing constraints present specified for D3D11_BIND_RENDER_TARGET.

    The Sample Count on the resource must be 1, and the Sample Quality must be 0.

    Note in the DDI, the names above become D3D11_DDI_BIND_*.

    5.3.9.2 Creating an Unordered Access View (UAV) at the DDI

    typedef struct D3D11DDIARG_CREATEUNORDEREDACCESSVIEW
    {
        D3D11DDI_HRESOURCE   hDrvResource;
        DXGI_FORMAT          Format;
        D3D11DDIRESOURCE_TYPE ResourceDimension;
    
        union
        {
            D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW  Buffer;
            D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW   Tex1D;
            D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW   Tex2D;
            D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW   Tex3D;
        };
    } D3D11DDIARG_CREATEUNORDEREDACCESSVIEW;
    

    The Format parameter must be compatible with the format the Resource was created with, and can be any format that supports being bound at the RenderTarget except for SRGB formats. Additional restrictions on the Format for Buffer views are discussed shortly below.

    The D3D11DDIARG_*_UNORDEREDACESSVIEW parameters, describing the view parameters based on resource dimension, are as follows:

    typedef struct D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW
    {
        UINT     FirstElement;
        UINT     NumElements;
        UINT     Flags; // see D3D11_DDI_BUFFER_UAV_FLAG* below
    } D3D11DDIARG_BUFFER_UNORDEREDACCESSVIEW;
    #define D3D11_DDI_BUFFER_UAV_FLAG_RAW    0x00000001
    #define D3D11_DDI_BUFFER_UAV_FLAG_STRUCTURED 0x00000002
    

    The _RAW_FLAG allows the shader to access the buffer simply as a 1D array of untyped 32-bit data. The Format must be specified as R32_TYPELESS when this flag is used. The underlying Buffer must have been created with D3D11_DDI_MISC_FLAG_ALLOW_RAW_VIEWS (D3D11_MISC_FLAG_ALLOW_RAW_VIEWS at the API).

    The _STRUCTURED flag (mutually exclusive to _RAW) requires that the Buffer was created as a Structured Buffer. The Format for a structured buffer must be specified as DXGI_FMT_UNKNOWN. The type information for the structured buffer will be inherited from the buffer resource.

    The absence of _RAW and _STRUCTURED flags means the Buffer View is Typed, so the Format of the view can be specified as freely as any with other UAV dimension (1D, 2D, 3D).

    When a UAV or SRV is Raw, the FirstElement parameter (defining the start of the view) must result in a 128bit aligned offset, otherwise the creation of the View will fail. Knowing the base address of a view is conveniently aligned enables various optimizations/assumptions in hardware given accesses from a shader that are offsets from the base of the view (where the offsets are often literals in the shader).

    typedef struct D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice;
        UINT     ArraySize;
    } D3D11DDIARG_TEX1D_UNORDEREDACCESSVIEW;
    
    typedef struct D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW
    {
        UINT     MipSlice;
        UINT     FirstArraySlice;
        UINT     ArraySize;
    } D3D11DDIARG_TEX2D_UNORDEREDACCESSVIEW;
    
    typedef struct D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW
    {
        UINT     MipSlice;
        UINT     FirstW;
        UINT     WSize;
    } D3D11DDIARG_TEX3D_UNORDEREDACCESSVIEW;
    

    5.3.9.3 Binding an Unordered Access View at the DDI

    The D3D11 OMSetRenderTargets API/DDI accepts both RenderTargetViews, DepthStencilView, and UnorderedAccessViews at the same time. This affects the Graphics side of the pipeline, not the Compute side. Here is the DDI:

    typedef VOID ( APIENTRY* PFND3D11DDI_SETRENDERTARGETS )(
        D3D10DDI_HDEVICE, // device handle
        CONST D3D11DDI_HRENDERTARGETVIEW*, // array of RenderTargetViews,
        UINT, // index of first RTV to set
        UINT, // number of RTVs being set (all others unbound)
        D3D10DDI_HDEPTHSTENCILVIEW, // DepthStencilView
        CONST D3D11DDI_HUNORDEREDACCESSVIEW*, // array of UnorderedAccessViews,
        UINT*, // Array of Append buffer offsets (relevant only for
               // UAVs which have the Append flag (otherwise ignored).
               // -1 means keep current offset.  Any other value sets
               // the hidden counter for that Appendable UAV.
        UINT, // index of first start of UAVs to set
        UINT, // number of UAVs being set (all others unbound)
        UINT, // the first UAV in the set of updated UAVs (including NULL bindings)
        UINT // the number of UAVs in the set of updated UAVs (including NULL bindings)
     )
    

    There is a separate CSSetUnorderedAccessViews API/DDI that accepts UnorderedAccessViews to be bound for the Compute side of the device. It is similar to the above, except doesn’t include RenderTargets.

    The last two parameters, UAVRangeStart and UAVRangeSize exist at the DDI level and not at the OMSetRenderTargets API level. The Direct3D 11 runtime tracks the set of bound UAVs which have changed (which may be different from the set of bound UAVs overall) whereby the driver may use this information for optimization purposes.

    5.3.9.4 Hazard Tracking

    UAVs have the same precedence in Hazard Tracking as RTVs and SO Targets:

    If a subresource is ever bound as an output (RTV/UAV/SO Target), subsequently unbound, and then bound as a shader input, a ReadAfterWriteHazard DDI is called. Drivers can use this as a hint as to when a rendering flush may be required. There are additional situations where Read After Write hazards are reported given the two pipelines – Graphics and Compute, in particular resources moving from output binding on one side to input binding on the other side, as well Compute outputs moving to Compute input. Note UAVs are considered as "output", since if an application only needs to read a resource, it should be bound as an input instead.

    5.3.9.5 Limitations on Typed UAVs

    There is a significant and unfortunate limitation in many hardware designs that had to be built into D3D. While Typed UAVs support many formats – essentially any format that can be a RenderTarget - the majority of these formats only support being written as a UAV, but not read at the same time.

    Shader Resource Views are of course always available in any shader stage when only read-only access from arbitrary locations in a Typed resource is needed. Conversely, it is useful that if write-only access to arbitrary locations in a Typed resource is needed, UAVs support that scenario.

    However, simultaneous reading and writing to a UAV within a single Draw* or Dispatch* operation is only supported if the UAV’s Type is R32_UINT/_SINT/_FLOAT. In particular, the ld_uav_typed IL instruction for reading from a typed UAV is limited to R32_UINT/_SINT/_FLOAT formats. E.g. a UAV with a type such as R8G8B8A8_UNORM_SRGB cannot be read from (but it can be written).

    D3D has a partial workaround for this inability to simultaneously read+write from Typed UAVs. The purpose is to make tasks such as editing an image in-place simpler, given the circumstances.

    D3D allows Texture1D/2D/3D resources created with any of the following small set of 32-bit per element formats to have UAVs created from them with R32_UINT/_SINT/_FLOAT as the type:

    Once an R32_* UAV is created, it allows arbitrary reading and writing to the UAV’s memory in-place. The catch is there is no type conversion since the format is R32_*, meaning reads and writes simply move raw data unaltered between a shader and memory. Since the desire of the application is that the memory is really interpreted as some format like DXGI_FORMAT_R8G8B8A8_UNORM_SRGB, the application is responsible for manually performing type conversion in the shader code upon reads and writes to the R32_* UAV.

    The upside is that because the original resource was created with one of the _TYPELESS formats listed above, it allows other views such as Shader Resource Views or Render Target Views to be created using the actual format that the application intended – such as DXGI_FORMAT_R8G8B8A8_UNORM_SRGB. These properly typed views can then benefit from the fixed-function hardware type conversion upon reading and writing to the format during texture filtering on read or blending on writes, even though these were not available to the UAV, where manual type conversion code had to be done in the shader.

    The formats supporting this casting to R32_* are limited those for which the hardware really makes no difference in memory layout versus R32_*, but excluding a few that have complex encoding cost such as DXGI_FORMAT_R11G11B10_FLOAT. If this ability to cast to R32_* UAVs was not included in D3D, applications would have to perform a copy rendering pass to move data from an R32_* resource where the image editing occurred to a separate resource that has the desired type (e.g. R10G10B10A2_UNORM), which is a waste of memory.

    5.3.10 Unordered Count and Append Buffers

    Unordered Append Buffers enable a usage pattern whereby Pixel Shader and Compute Shaders can write structures of data to memory in variable quantity, in an unordered way. Hardware can take advantage of knowing this type of operation is going on, producing optimized performance.

    5.3.10.1 Creating Unordered Count and Append Buffers

    For Structured Buffers that have been created with the Bind flag: D3D11_DDI_BIND_UNORDERED_ACCESS, Unordered Access Views can be created with one of the optional flags D3D11_DDI_BUFFER_UAV_FLAG_COUNTER or D3D11_DDI_BUFFER_UAV_FLAG_APPEND. The latter flag gives up some flexibility for (possibly) performance – described later.

    Creating a Structured Buffer UAV with UAV_FLAG_COUNTER causes the driver to allocate storage for a single hidden 32-bit unsigned integer counter associated with the UAV (as opposed to being associated with the underlying resource), initialized to 0. Multiple UAVs created on the same Buffer with this flag will thus have multiple independent counters.

    Shaders can atomically increment or decrement this count (but not do both in one shader) and use the returned index to indicate which structure index in the UAV to access. If the _COUNTER flag is used, count values (representing struct index) returned to the shader may be saved for use later after the shader has completed, for example for linked lists.

    If the _APPEND flag is used when creating the UAV, a counter is created like with the _COUNTER flag, except the counter values returned to a shader invocation when incrementing or decrementing the count are only valid for the lifetime of the shader invocation. So the shader can use the index during the shader invocation to access the corresponding struct index in the UAV, but the hardware is permitted to reorder the struct layout from the point of view of anything outside the shader invocation, or after the shader invocation is complete. This is for cases where an application is simply generating struct records and it does not care that the order of the records is maintained. However if the application goes out of its way to examine the buffer (such as copying from it or using some other type of View) the hardware will have to pack the records into the range of struct locations corresponding to the number of times shader invocations incremented the counter on a given UAV. Even though the data will appear packed, the structs may be reordered. Some hardware will take advantage of not having to maintain the order to provide better access performance.

    5.3.10.2 Using Unordered Count and Append Buffers

    When Pixel Shaders and Compute Shaders bind UAVs that have _COUNT or _APPEND usage specified, an initial value for the View’s hidden counter must be provided as part of the bind call. Specifying -1 means maintain the current counter value already in the Buffer. Any other value sets the counter value.

    When an Append UAV is bound to the pipeline, the instructions that can access it are restricted to the following:

    imm_atomic_alloc(22.17.17) store_structured(22.4.13) imm_atomic_consume(22.17.18) ld_structured(22.4.12)

    For an Append UAV, the HLSL compiler can use imm_atomic_alloc to obtain an "address" and then use a sequence of store_* commands to write out data a unique location in the unordered output to the UAV.

    Conversely, the HLSL compiler can use imm_atomic_consume to obtain an "address" that already has data and then use a sequence of ld_* commands to read back data from a unique location in the UAV.

    For Append UAVs, the count values returned by imm_atomic_alloc and imm_atomic_consume are hidden from the shader by the HLSL compiler, which exposes simply the ability to Append() structs or Consume() structs (not both in the same shader).

    For Count UAVs, where the returned count value may be stored, any instructions capable of accessing Structured Buffers are permitted from the shader, in addition to all of the instructions listed above. Unlike Append UAVs, the HLSL compiler exposes the count values returned by imm_atomic_alloc and imm_atomic_consume for access in the shader – allowing the value to be saved.

    The counter behind imm_atomic_alloc and imm_atomic_consume has no overflow or underflow clamping, and there is no feedback given to the shader as to whether overflow/underflow happened (wrapping of the counter). The only thing the counter really accomplishes is a way of generating unique addresses that is conveniently bundled with the UAV.

    It is invalid for a single shader, or multiple shaders in flight on a GPU, to have the presence of both imm_atomic_alloc and imm_atomic_consume instructions operating on the same UAV. For a single shader, compilation fails if these operations (however they appear in HLSL) are mixed. The GPU must guarantee that Shader invocations from separate Draw*/Dispatch operations do not run out of sequence when there is a possibility that an alloc/consume hazard could exist.

    The counter associated with a Count/Append UAV is somewhat like the counters that are associated with Stream Output buffers (note a Buffer cannot be both a Stream Output and Count/Append Buffer), although those counters have slightly different semantics. There is an API/DDI CopyStructureCount which allows the hidden count in a Count/Append UAV to be copied to another Buffer. This can serve as the vertex count parameter to Draw*InstancedIndirect, allowing data that has been written to an Append Buffer to be recirculated back into the GPU without CPU knowledge of the exact quantity involved.

    When Append/Count UAVs are bound to the pipeline the application can specify what the initial counter value should be, or choose to maintain the existing count value.

    For an Append UAV, since the storage is unordered, when binding the UAV to the pipeline as a UAV or any other tpe of view (e.g. SRV), the contents of any struct entries in the UAV beyond the count value become undefined, and any contents within the count value are maintained, but may be reordered. It is fine for multiple different types of UAVs to overlap, but the application has to beware of the effect that the unordered nature of Append UAVs may have (when bound/used) on other overlapping views of the same memory. It is safest for an application not to mix usage of overlapping UAVs with expectations of data order being maintained in between.

    Count UAVs do not create any such ordering issues, since by definition applications are allowed to save count values as references to specific locations in the UAV.

    For some implementations, Append UAVs will behave identically to Count UAVs (e.g. no reordering). Still, if the application does not care about the ordering of records being maintained in the UAV, it does not hurt (and can only help on some implementations) to make use of the constrained Append semantics for generating and subsequently consuming unordered collections of items.

    5.3.11 Video Views

    As of the D3D11.1 API/DDI, Video Resources can have SRV/RTV/UAVs created so that D3D shaders can process them. The way the underlying Video Resource shows up in D3D as an ID3D11Resource* is described in separate D3D11 Video specs. This section covers how given an ID3D11Resource* to a Video Resource, SRV/RTV/UAVs can be created in D3D.

    These Video Resources will be either Texture2D or Texture2DArray, so the ViewDimension in the VIEW_DESC structure must match. Additionally, the format of the underlying Video Resource restricts the formats that the View can use.

    The following table describes all the combinations of Video Resource and View(s) that can be made from them. Note that multiple views of different parts of the same surface can be created, and depending on the format they may have different sizes from each other. A few video formats do not support D3D SRV/UAV/RTVs at all: DXGI_FORMAT_420_OPAQUE, _AI44, _IA44, _P8 and _A8P8. Further details on all the video formats is provided in the D3D11 Video DDI spec.

    Runtime read+write conflict prevention logic (which stops a resource from being bound as an SRV and RTV/UAV at the same time) treats Views of different parts of the same Video surface as conflicting for simplicity. It doesn’t seem interesting to allow the case of reading from luma while simultaneously rendering to chroma in the same surface, for example, even though it may be possible in hardware.

    Video Resource
    Format
    (DXGI_FORMAT_*)
    Valid View Format
    (DXGI_FORMAT_*)
    Meaning Mapping to
    View Channel
    View Types
    Supported
    AYUV
    (This is the most
    common YUV
    4:4:4 format)
    R8G8B8A8_{UNORM|UINT},
    or for UAVs, an
    additional choice: R32_UINT
    Straightforward mapping of the entire
    surface in one view.

    Using R32_UINT for UAVs allows both
    read and write (as opposed
    to just write for the other format)
    V->R8,
    U->G8,
    Y->B8,
    A->A8
    SRV,
    RTV,
    UAV
    YUY2
    (This is the most
    common YUV
    4:2:2 format)
    R8G8B8A8_{UNORM|UINT},
    or for UAVs, an
    additional choice: R32_UINT
    Straightforward mapping of the entire
    surface in one view.

    Using R32_UINT for UAVs allows both read
    and write (as opposed to just write for the
    other format)
    Y0->R8,
    U0->G8,
    Y1->B8,
    V0->A8
    SRV,
    UAV
    R8G8_B8G8_UNORM In this case the width of the view will
    appear to be twice the R8G8B8A8 view would
    be, with hardware reconstruction of RGBA done
    automatically on read (and before filtering).
    This has been in D3D hardware for a long time
    (legacy) though it likely is not interesting any more.
    Y0->R8,
    U0->G8[0],
    Y1->B8,
    V0->G8[1]
    SRV
    NV12
    (This is the most
    common YUV
    4:2:0 format)
    R8_{UNORM|UINT} Luminance Data View Y->R8 SRV,
    RTV,
    UAV
    R8G8_{UNORM|UINT} Chrominance Data View
    (width and height are each 1/2
    of luminance view)
    U->R8,
    V->G8
    SRV,
    RTV,
    UAV
    NV11
    (This is the most
    common YUV
    4:1:1 format)
    R8_{UNORM|UINT} Luminance Data View Y->R8 SRV,
    RTV,
    UAV
    R8G8_{UNORM|UINT} Chrominance Data View
    (width and height are each 1/4
    of luminance view)
    U->R8,
    V->G8
    SRV,
    RTV,
    UAV
    P016
    (This is a 16 bit per
    channel planar
    4:2:0 format)
    R16_{UNORM|UINT} Luminance Data View Y->R16 SRV,
    RTV,
    UAV
    R16G16_{UNORM|UINT},
    or for UAVs, an
    additional choice: R32_UINT
    Chrominance Data View
    (width and height are each 1/2
    of luminance view)

    Using R32_UINT for UAVs allows both read and
    write (as opposed to just write for the other format)
    U->R16,
    V->G16
    SRV,
    RTV,
    UAV
    P010
    (This is a 10 bit per
    channel planar
    4:2:0 format)
    R16_{UNORM|UINT} Luminance Data View

    D3D does not enforce or care whether or not the
    lowest 6 bits are 0 (given this is a 10 bit format
    using 16 bits) – application shader code would have
    to enforce this manually if desired. From the D3D
    point of view, this is format is no different than P016.
    Y->R16 SRV,
    RTV,
    UAV
    R16G16_{UNORM|UINT,
    or for UAVs, an
    additional choice: R32_UINT
    Chrominance Data View
    (width and height are each 1/2 of luminance view)

    Using R32_UINT for UAVs allows both read and write
    (as opposed to just write for the other format)

    Same comment as above about this 10 bit format using
    16 bits.
    U->R16,
    V->G16
    SRV,
    RTV,
    UAV
    Y216
    (This is a 16 bit per
    channel packed
    4:2:2 format)
    R16G16B16A16_{UNORM|UINT} Straightforward mapping of the entire surface
    in one view.
    Y0->R16,
    U->G16,
    Y1->B16,
    V->A16
    SRV,
    UAV
    Y210
    (This is a 10 bit per
    channel packed
    4:2:2 format)
    R16G16B16A16_{UNORM|UINT} Straightforward mapping of the entire surface
    in one view.

    D3D does not enforce or care whether or not the
    lowest 6 bits are 0 (given this is a 10 bit format
    using 16 bits) – application shader code would have
    to enforce this manually if desired. From the D3D
    point of view, this is format is no different than Y216.
    Y0->R16,
    U->G16,
    Y1->B16,
    V->A16
    SRV,
    UAV
    Y416
    (This is a 16 bit per
    channel packed
    4:4:4 format)
    R16G16B16A16_{UNORM|UINT} Straightforward mapping of the entire surface
    in one view.
    U->R16,
    Y->G16,
    V->B16,
    A->A16
    SRV,
    UAV
    Y410
    (This is a 10 bit per
    channel packed
    4:4:4 format)
    R10G10B10A2_{UNORM|UINT},
    or for UAVs, an
    additional choice: R32_UINT
    Straightforward mapping of the entire surface
    in one view. Using R32_UINT for UAVs allows both
    read and write (as opposed to just write for the
    other format).
    U->R10,
    Y->G10,
    V->B10,
    A->A2
    SRV,
    UAV


    5.4 Resource Creation


    5.4.1 Overview

    Resources have the following properties in common, specified at Resource creation:

    Resources are made up of one of more Subresources. These Subresources share a common lifespan with each other and the Resource. In other words, the Resource and Subresources are atomically allocated and destroyed. However, some operations occur at the Subresource level, versus the Resource level. Subresources are three dimensional entities (with height, width, depth, pitch, and slice pitch), but degenerate into two and one dimensional entities for a certain Resource. For ex. a fully mipped Texture2D Resource creation with a width of two, a height of two, and an array size of two will have four Subresources that can be individually referenced for certain operations. Two Subresources have a width of two, height of two, and depth of one. These two Subresources are the most detailed mip level. The additional two Subresources have a width of one, height of one, and depth of one. Each Subresource is allowed to have it's own address, so the Resource may have somewhere between one and four disjoint allocations to satisfy the previous example. Each Subresource inherits the properties of the Resource, and Subresources may not be part of multiple Resources.

    typedef enum D3D10DDIRESOURCE_TYPE
    {
        D3D10DDIRESOURCE_BUFFER      = 1,
        D3D10DDIRESOURCE_TEXTURE1D   = 2,
        D3D10DDIRESOURCE_TEXTURE2D   = 3,
        D3D10DDIRESOURCE_TEXTURE3D   = 4,
        D3D10DDIRESOURCE_TEXTURECUBE = 5,
    #if D3D11DDI_MINOR_HEADER_VERSION >= 1
        D3D11DDIRESOURCE_BUFFEREX    = 6,
    #endif
    } D3D10DDIRESOURCE_TYPE;
    
    typedef struct D3D10DDI_MIPINFO
    {
        UINT TexelWidth;
        UINT TexelHeight;
        UINT TexelDepth;
        UINT PhysicalWidth;
        UINT PhysicalHeight;
        UINT PhysicalDepth;
    } D3D10DDI_MIPINFO;
    
    typedef struct D3D10_DDIARG_SUBRESOURCE_UP
    {
        VOID*   pSysMem;
        UINT  SysMemPitch;
        UINT  SysMemSlicePitch;
    } D3D10_DDIARG_SUBRESOURCE_UP;
    
    typedef struct D3D11DDI_HRESOURCE
    {
        void* m_pDrvPrivate;
    } D3D11DDI_HRESOURCE;
    
    // Bits for D3D11DDI_CREATERESOURCE::BindFlags
    
    typedef enum D3D10_DDI_RESOURCE_BIND_FLAG
    {
        D3D10_DDI_BIND_VERTEX_BUFFER     = 0x00000001L,
        D3D10_DDI_BIND_INDEX_BUFFER      = 0x00000002L,
        D3D10_DDI_BIND_CONSTANT_BUFFER   = 0x00000004L,
        D3D10_DDI_BIND_SHADER_RESOURCE   = 0x00000008L,
        D3D10_DDI_BIND_STREAM_OUTPUT     = 0x00000010L,
        D3D10_DDI_BIND_RENDER_TARGET     = 0x00000020L,
        D3D10_DDI_BIND_DEPTH_STENCIL     = 0x00000040L,
        D3D10_DDI_BIND_PIPELINE_MASK     = 0x0000007FL,
    
        D3D10_DDI_BIND_PRESENT           = 0x00000080L,
        D3D10_DDI_BIND_MASK              = 0x000000FFL,
    
    #if D3D11DDI_MINOR_HEADER_VERSION >= 1
        D3D11_DDI_BIND_UNORDERED_ACCESS  = 0x00000100L,
    
        D3D11_DDI_BIND_PIPELINE_MASK     = 0x0000017FL,
        D3D11_DDI_BIND_MASK              = 0x000001FFL,
    #endif
    } D3D10_DDI_RESOURCE_BIND_FLAG;
    
    // Bits for D3D11DDI_CREATERESOURCE::MapFlags
    typedef enum D3D10_DDI_CPU_ACCESS
    {
        D3D10_DDI_CPU_ACCESS_WRITE          = 0x00000001L,
        D3D10_DDI_CPU_ACCESS_READ           = 0x00000002L,
        D3D10_DDI_CPU_ACCESS_MASK          = 0x00000003L,
    } D3D10_DDI_CPU_ACCESS;
    
    // Bits for D3D11DDI_CREATERESOURCE::Usage
    typedef enum D3D10_DDI_RESOURCE_USAGE
    {
        D3D10_DDI_USAGE_DEFAULT    = 0,
        D3D10_DDI_USAGE_IMMUTABLE  = 1,
        D3D10_DDI_USAGE_DYNAMIC    = 2,
        D3D10_DDI_USAGE_STAGING    = 3,
    } D3D10_DDI_RESOURCE_USAGE;
    
    // Bits for D3D11DDI_CREATERESOURCE::MiscFlags
    typedef enum D3D10_DDI_RESOURCE_MISC_FLAG
    {
        D3D10_DDI_RESOURCE_AUTO_GEN_MIP_MAP             = 0x00000001L,
        D3D10_DDI_RESOURCE_MISC_SHARED                  = 0x00000002L,
        // Reserved for D3D11_RESOURCE_MISC_TEXTURECUBE   0x00000004L,
        D3D10_DDI_RESOURCE_MISC_DISCARD_ON_PRESENT      = 0x00000008L,
    #if D3D11DDI_MINOR_HEADER_VERSION >= 1
        D3D11_DDI_RESOURCE_MISC_DRAWINDIRECT_ARGS       = 0x00000010L,
        D3D11_DDI_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS  = 0x00000020L,
        D3D11_DDI_RESOURCE_MISC_BUFFER_STRUCTURED       = 0x00000040L,
        D3D11_DDI_RESOURCE_MISC_RESOURCE_CLAMP          = 0x00000080L,
    #endif
        // Reserved for D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX 0x00000100L,
        // Reserved for D3D11_RESOURCE_MISC_GDI_COMPATIBLE 0x00000200L,
        D3D10_DDI_RESOURCE_MISC_REMOTE                  = 0x00000400L,
    } D3D10_DDI_RESOURCE_MISC_FLAG;
    
    typedef struct D3D11DDIARG_CREATERESOURCE
    {
        CONST D3D10DDI_MIPINFO*              pMipInfoList;
        CONST D3D10_DDIARG_SUBRESOURCE_UP*   pInitialDataUP; // non-NULL if Usage has invariant
        D3D10DDIRESOURCE_TYPE                ResourceDimension; // Part of old Caps1
    
        UINT                                 Usage; // Part of old Caps1
        UINT                                 BindFlags; // Part of old Caps1
        UINT                                 MapFlags;
        UINT                                 MiscFlags;
    
        DXGI_FORMAT                          Format; // Totally different than D3DDDIFORMAT
        DXGI_SAMPLE_DESC                     SampleDesc;
        UINT                                 MipLevels;
        UINT                                 ArraySize;
    
        // Can only be non-NULL, if BindFlags has D3D10_DDI_BIND_PRESENT bit set; but not always.
        // Presence of structure is an indication that Resource could be used as a primary (ie. scanned-out),
        // and naturally used with Present (flip style). (UMD can prevent this- see dxgiddi.h)
        // If pPrimaryDesc absent, blt/ copy style is implied when used with Present.
        DXGI_DDI_PRIMARY_DESC*               pPrimaryDesc;
    
        UINT                                 ByteStride; // 'StructureByteStride' at API
    } D3D11DDIARG_CREATERESOURCE;
    
        // part of user mode Device interface:
        STDMETHOD_( SIZE_T, CalcPrivateResourceSize )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATERESOURCEIN* pCreateResourceIn );
        STDMETHOD( CreateResource )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_CREATERESOURCEIN* pCreateResourceIn,
            D3D11DDI_HRESOURCE hDrvResource );
        STDMETHOD_( void, DestroyResource )( D3D10DDI_HDEVICE hDrvDevice,
            D3D11DDI_HRESOURCE hDrvResource );
    

    5.4.2 Creating a Structured Buffer

    A structured buffer(5.1.3) is created by specifying both a new misc flag and the stride of the structure.

    The only D3D11 Resource type that can have a structure defined is the Buffer type. When the Resource is created at the API, the misc flag D3D11_RESOURCE_MISC_STRUCTURED_BUFFER and a structure stride in bytes must be specified.

    The StructureByteStride can be at most 2048 bytes.

    The D3D11_RESOURCE_MISC_STRUCTURED_BUFFER flag cannot be combined with D3D11_RESOURCE_MISC_ALLOW_RAW_VIEWS (described elsewhere).

    The D3D11_RESOURCE_MISC_STRUCTURED_BUFFER flag may be combined with any of the following bind flags:

    The D3D11_RESOURCE_MISC_STRUCTURED_BUFFER flag may NOT be combined with any of the following bind flags:

    Buffers that define a structure cannot be used with the InputAssembler, either for vertex or index data. Structured buffers also cannot be bound as a stream output target or render target.

    If the D3D11_RESOURCE_MISC_STRUCTURED_BUFFER is not set, then StructureByteStride parameter to the Buffer creation must be 0. If not, the runtime will fail the creation call.

    If the D3D11_RESOURCE_MISC_STRUCTURED_BUFFER is set, then StrideInBytes must be non-zero and ByteWidth must be evenly divisible by StructureByteStride . If either condition is not true when creating a structured buffer, the create call will be failed by the runtime.

    5.5 Resource Dimensions

    Resource size dimensions (Width, Height, Depth) are always specified in pixel units. Size dimensions are restricted only for subsampled and block compressed formats (see Formats(19.1) section), and are otherwise restricted only to positive integers. Furthermore, the size dimensions of a Resource have no bearing on what functionality is available for the resource (such as filtering support).

    Resource pitches are always expressed in bytes, and indicate the memory delta between the start of pixel rows or array slices, with the only exception being block compressed formats, where the pitch is defined as between between 'block' rows instead of pixel rows. Pitch values are restricted only to non-negative integers, intentionally including zero for which the first row will be replicated to all rows.

    Size dimensions for lower level mipmapped resources are computed by the Direct3D runtime based on the size of the level zero map. These computed dimensions are adjusted upward as necessary to adhere to physical size dimension restrictions for subsampled and block compressed formats - refer to the discusson of physical and virtual dimensions in Block Compressed Formats(19.5) and Sub-Sampled Formats(19.4).


    5.6 Resource Manipulation


    Section Contents

    (back to chapter)

    5.6.1 Mapping

    5.6.1.1 Map Flags
    5.6.1.2 Map() NO_OVERWRITE on Dynamic Buffers used as Shader Resource Views
    5.6.1.3 Map() on DEFAULT Buffers used as SRVs or UAVs
    5.6.2 CopySubresourceRegion
    5.6.2.1 CopySubresourceRegion with Same Source and Dest
    5.6.2.2 CopySubresourceRegion Tileable Copy Flag
    5.6.3 CopyResource
    5.6.4 Staging Surface CPU Read Performance (primarily for ARM CPUs)
    5.6.5 Structured Buffer: CopyResource, CopySubresourceRegion
    5.6.6 Multisample Resolve
    5.6.7 FlushResource
    5.6.8 UpdateSubresourceUP
    5.6.9 UpdateSubresource and CopySubresourceRegion with NO_OVERWRITE or DISCARD


    5.6.1 Mapping

    Mapping/ locking is done at the Subresource level, instead of the Resource level. Mapping means granting CPU access to the Subresource's storage or contents. Typically, the user mode driver must invoke the Lock callback to achieve this operation. The application subsequently relinquishes direct access to mapped Subresources by unmapping them. Only one Map for a given Subresource is allowed (even for non-overlapping regions) and no accelerator operations on a Subresource may be ongoing while a Map is outstanding on that Subresource. However, multiple Subresources of the same Resource may be Mapped at the same time. Each Map method returns a structure that contains a pointer to the storage backing the Resource, and pitch values representing the distances between rows or planes of data, depending on the Subresource dimensionality. The returned pointer always points to the top-left byte (U = 0, V = 0, W = 0) to the mapped Subresource. The layout is similar to that of a multidimensional 'C' array, where the Subresource can be considered to be the following 'C' declaration:

    Pixel_Type Subresource [ W ][ V ][ U ];
    

    with the additional characteristic that the driver is allowed to specify the byte pitch between each row (or block-row for BC formats) and each depth slice.

    When returning a pointer to the mapped resource, the pointer must be 16-byte aligned. This restriction allows applications to perform SSE-optimized operations on the data natively, without realignment or copy (example usages include CPU geometry and texture processing).

    // D3D11.3 Mapping/ Locking:
    // One, more, or none: CPUREAD, CPUWRITE
    // Exclusively one or none: RANGEVALID, AREAVALID, BOXVALID
    // Exclusively one or none: DISCARDRESOURCE
    
    // Bits for D3D11DDIARG_MAPIN::Flags
    #define D3D11DDILOCK_CPUREAD
    #define D3D11DDILOCK_CPUWRITE
    #define D3D11DDILOCK_RANGEVALID
    #define D3D11DDILOCK_AREAVALID
    #define D3D11DDILOCK_BOXVALID
    #define D3D11DDILOCK_DISCARDRESOURCE
    #define D3D11DDILOCK_NOOVERWRITE
    
    typedef struct D3D11DDIARG_MAPIN
    {
        D3D11DDI_HRESOURCE hResource;   // in: resource identifier
        UINT32            Subresource; // in: zero based subresource index
        UINT32            Flags;       // in: flags
    } D3D11DDIARG_LOCKIN;
    
    typedef struct D3D11DDIARG_MAPOUT
    {
        void*  pSurfData;  // out: pointer to memory
        SIZE_T Pitch;      // out: pitch of memory
        SIZE_T SlicePitch; // out: slice pitch of memory
    } D3D11DDIARG_MAPOUT;
    
    typedef struct D3D11DDIARG_UNMAPIN
    {
        D3D11DDI_HRESOURCE hResource;   // in: resource identifier
        UINT32            Subresource; // in: zero based subresource index
    } D3D11DDIARG_UNMAPIN;
    
        // part of user mode Device interface:
        STDMETHOD( Map )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_MAPIN* pMapIn, D3D11DDIARG_MAPOUT* pMapOut ) = 0;
        STDMETHOD( Unmap )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_UNMAPIN* pUnmapIn ) = 0;
    

    5.6.1.1 Map Flags

    5.6.1.2 Map() NO_OVERWRITE on Dynamic Buffers used as Shader Resource Views

    Map() allows NO_OVERWRITE for Buffers with DYNAMIC usage and the SHADER_RESOURCE (shader input) bind flag. Before D3D11.1 this was disallowed (though DISCARD was allowed).

    Before the first call with NO_OVERWRITE on a deferred context, a DISCARD must be done on the same context (via Copy*()/Update*()/Map() API flag or Discard*() API). This is not required on immediate contexts if the application knows the GPU is finished with the resource (though discard can be used if not).

    This feature is required to be supported for all D3D10+ hardware with D3D11.1 drivers.

    The background here is that Map() NO_OVERWRITE used to be allowed on Dynamic Index Buffers or Vertex Buffers. Game developers would use this to perform a sliding window of successive buffer updates while rendering follows along. The driver would not have to rename the surface and the GPU did not have to flush rendering while it referenced the Buffer even as the application updated other parts of it.

    Increasingly developers have found reasons to pass the same sort of data into shaders directly (via Shader Resource View) to take advantage of the extra flexibility versus the fixed function semantics of Vertex and Index Buffers at the Input Assembler. As of D3D10, Map() NO_OVERWRITE was not allowed on DYNAMIC Buffers with the Shader Resource bind flag, however. This was simply an oversight, hindering the ability to efficiently feed vertex/index style data directly to shaders.

    5.6.1.3 Map() on DEFAULT Buffers used as SRVs or UAVs

    Map() can be called on Buffers with DEFAULT usage and SHADER_RESOURCE and/or UNORDERED_ACCESS bind flags.

    The Buffer can have MiscFlags BUFFER_ALLOW_RAW_VIEWS, BUFFER_STRUCTURED or nothing.

    Before D3D11.2 this was disallowed. As of D3D11.2, this feature is required to be supported for Feature Level 11.0+ devices with WDDM1.3+ drivers.

    The goal here was to reduce the number of copies required to transfer Buffer data to and from the GPU. Previously, to allow CPU access of the data generated in a DirectCompute computation, an app had to perform an intermediate copy to a STAGING resource. This was due to the fact that only STAGING resources could be directly accessed by the CPU. The need for this copy resulted in a measureable performance hit on bandwidth-intensive DirectCompute scenarios.

    This feature exposed the ability to create Default buffers marked with D3D11_CPU_ACCESS_FLAGs, as long as their creation description matched the specific configuration options described. These restrictions were designed merely to scope down the investigation and development work to fit within budget while enabling the core scenario, not because hardware necessarily has the same degree of constraint.

    5.6.2 CopySubresourceRegion

    This function allows sub-region copying of data from one Subresource to another. No stretch, color key, blend, nor format conversion. However, format types of each Subresource need not be exactly equal to each other, as the Resource may be Prestructured+Typeless Memory(5.1.5), which is also supported. For example, a R32_FLOAT Texture can be copied to an R32_UINT Texture, as both of these formats are in the same R32_TYPELESS group. Conceptually, the interpreted value of texels changes during this type of copy; but the raw value of memory happens to be equal. This function also works when both Subresources are Unstructured Memory(5.1.2) also, except that the regions to copy will be in raw bytes, versus pixel or Element units.

    In addition, the Subresources need not be of equal size; but the source and destination regions must fit entirely within the Subresources. The source and destination Subresources must not be the same Subresources.

    Resources which can be used as Depth/ Stencil cannot partipate in this operation as a destination; but they can as a source. Multisampled Resources cannot partcipate in Copy operations.

    typedef struct D3D11DDIARG_COPYSUBRESOURCEREGIONIN
    {
        D3D11DDI_HRESOURCE hDstResource;   // in: resource identifier
        UINT32             DstSubresource; // in: zero based subresource index
        POINT3D            DstPoints;      // in: Destination Offset
        D3D11DDI_HRESOURCE hSrcResource;   // in: resource identifier
        UINT32             SrcSubresource; // in: zero based subresource index
        CONST D3D11_BOX*   SrcBox;         // in: Source Region
    } D3D11DDIARG_COPYSUBRESOURCEREGIONIN;
    
        // part of user mode Device interface:
        STDMETHOD( CopySubresourceRegion )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_COPYSUBRESOURCEREGIONIN* pCopySubresourceRegionIn ) = 0;
    

    5.6.2.1 CopySubresourceRegion with Same Source and Dest

    CopySubresourceRegion*() allow the source and dest to be the same resource, with D3D11.1 drivers. The driver must handle overlapping copies.

    This feature is required to be supported for all D3D10+ hardware with D3D11.1 runtime+drivers. When the application uses feature level 9.x all drivers support this with the D3D11.1 runtime.

    5.6.2.2 CopySubresourceRegion Tileable Copy Flag

    CopySubresourceRegion*() allows a new TILEABLE flag when the source is a currently bound RenderTarget (flag ignored otherwise). This is intended for tile / deferred rendering GPUs (no impact on the copy for non-tiled rendering GPUs). The flag indicates that if the GPU happens to be processing only given tile of a RenderTarget at a time (where the RenderTarget is the source in the copy), the GPU can break the copy call to occur per-tile along with the surrounding rendering calls batched for the scene, without having to flush the scene for all tiles.

    The application is guaranteeing that future access to the destination of the copy will only be used for 1:1 cycling of that data back into the same pixel location of the affected RenderTarget (which remains bound). Said another way, the application is guaranteeing that when a tiling GPU replays batched rendering commands to produce any given tile, there will be no visible effect (e.g. to commands earlier in the batch) of the copy having already occured for previously processed tiles.

    The source and dest don't have to be the same size resource; this flag is relevant to just the region being copied.

    When the application is finished using the target of the TILEABLE copy for recirculating back to the original surface, DiscardResource() should be called if the contents are no longer needed (but this is not strictly required). For some implementations, knowing the end of life of the data in the scratch surface could allow the entire copy to be optimized away into leaving the data in fast tile memory and never having to write it out to GPU memory.

    If an application violates the 1:1 property when using the TILEABLE flag on CopySubresourceRegion, such as reading into a different pixel, or into a shader stage other than the Pixel Shader in the second pass, the the data being read is undefined (it will have been generated by an unknown rendering pass by the application or uninitilized).

    If the RenderTarget gets unbound, any copies from it that happened with the TILEABLE flag while bound lose the TILEABLE property after the RenderTarget unbinding.

    This feature is available for all D3D9+ hardware with D3D11.1 drivers (D3D9 portion of the DDI for D3D9 hardware and both D3D9 and D3D11.1 portions of the DDI for D3D10+ hardware).

    This feature will be exposed only to customers of Direct3D within the Windows OS, at least initially, given the narrowly focused application.

    An example of a valid scenario (Direct2D will do something similar to this, and likely other Windows components):

    The example does not work if additional copies are inserted from surface to surface (the length of the cycle can't be extended) - doing so just means the TILEABLE flag loses its value and the GPU will likely have to flush the scene. Behavior should be correct here but performance gains may be lost. In general just because the TILEABLE flag is used on a Copy doesn't mean there will not be a mid-scene flush - that could happen for other reasons, typically changing of RenderTargets. The tileable flag just means there is one less trigger for mid-scene flushes.

    5.6.3 CopyResource

    This function allows copying of an entire Resource, assuming the Resources are identical types and dimensions. No stretch, color key, blend, nor format conversion. However, format types of each Subresource need not be exactly equal to each other, as the Resource may be Prestructured+Typeless Memory(5.1.5), which is also supported. For example, a R32_FLOAT Texture can be copied to an R32_UINT Texture, as both of these formats are in the same R32_TYPELESS group. Conceptually, the interpreted value of texels changes during this type of copy; but the raw value of memory happens to be equal. This function also works when both Resources are Unstructured Memory(5.1.2).

    Resources which can be used as Depth/ Stencil cannot partipate in this operation as a destination; but they can as a source. Multisampled Resources cannot partcipate in Copy operations. This operation also impacts heavily on performant readback and upload scenarios.(5.3.2)

    typedef struct D3D11DDIARG_COPYRESOURCEIN
    {
        D3D11DDI_HRESOURCE hDstResource; // in: resource identifier
        D3D11DDI_HRESOURCE hSrcResource; // in: resource identifier
    } D3D11DDIARG_COPYRESOURCEIN;
    
        // part of user mode Device interface:
        STDMETHOD( CopyResource )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_COPYRESOURCEIN* pCopyResourceIn ) = 0;
    

    5.6.4 Staging Surface CPU Read Performance (primarily for ARM CPUs)

    On the ARM CPU, cache coherency isn’t provided when the GPU writes to system memory, so a GPU driver would normally be tempted to put a staging (D3D CPU memory) surface in uncached memory (which is slow for CPU access) to avoid incorrect values being read from the cache. However, the Win8 Video Memory Manager will manually flush the CPU cache on ARM when data has been copied from the GPU to a staging surface – so GPU drivers can safely use cacheable memory for STAGING surfaces (yielding good performance on CPU reads). VidMM will also flush CPU caches for the opposite case as well - before the GPU reads from a STAGING surface.

    At the D3D11.1 DDI, when a STAGING surface is created, the CPU_ACCESS flags (READ and/or WRITE) are mapped directly down through the DDI, so there it is obvious to drivers when the cacheable memory choice should be made (when WRITE is not set). For the D3D9 DDI (which all drivers for all hardware feature levels must implement), the mapping from D3D11's CPU_ACCESS flags to the D3D9 DDI’s is described in the separate API/DDI spec - see PFND3DDDI_CREATERESOURCE - the situation is SYSTEMMEMORY surfaces that don't have the WriteOnly flag set at the D3D9 DDI.

    A note for User Mode drivers: The driver must not cache Map on surfaces that rely on the software enforced coherency described above (i.e. surface is cacheable but mapped into an aperture segment which doesn’t support CacheCoherency). The driver must explicitly call LockCb and UnlockCb at every Map for such surfaces to give an opportunity to VidMm to apply the proper memory barrier. Failing to do so will result in the surface getting corrupted over time.

    5.6.5 Structured Buffer: CopyResource, CopySubresourceRegion

    CopyResource and CopySubresourceRegion allow either or both the source and destination to be structured buffers. It is possible to copy from linear to structured, structured to linear, and structured to structured. If copying between structured buffers, the strides must be the same or the runtime will fail the copy operation. If the region to copy is not specified as complete structures, then the runtime will fail the copy operation.

    When the either the source or destination is linear and the other is structured, it is up to the driver to do rearrange the layout if necessary. If structured buffers are stored linearly, then the copy operation is a straightforward copy. If not stored linearly, then any tiling or other reorganization must occur as part of the copy operation.

    5.6.6 Multisample Resolve

    Only multisample render targets are able to be resolved to a single-sampled resource. Naturally, the source must be a multisampled render target, while the destination must be a single-sampled resource restricted such that it resides in video memory. For example, the destination cannot be a dynamic or system-memory friendly Resource. Thus the destination Resource must be USAGE_DEFAULT. The algorithm to resolve multiple samples to one pixel is implementation dependent. Resolve shares some of the restrictions of Copy, such as both Resources must be the same type (ie. Texture2D), and no strecting. Only a whole Subresource can be resolved, so both Subresources must be the same dimensions. Format conversion is not desired for ResolveSubresource either. However, due to typeless Resources, there is an interesting interaction with either Resource Format. If each Resource is prestructured+typed, then both Resources must have the same Format; and that must match the passed in ResolveFormat (ie. all R32_FLOAT). If one Resource is prestructured+typeless, then the prestructured+typed Resource's format must be compatable with the typeless format; and the ResolveFormat must match the prestructured+typed format (ie. Src: R32_TYPELESS, Dst & ResolveFormat: R32_FLOAT). If both Resource are prestructured+typeless, then they must be equal formats, and the ResolveFormat may be any format compatable with the typeless format and supporting resolve. (ie. Src & Dst: R32_TYPELESS -> ResolveFormat must be R32_FLOAT).

    Further discussion on format interpretations and Multisample Resolve can be found in the Multisample Format Support(19.2) section.

    Multisample resolve is performed in linear space, so conversion to linear for sRGB formats is performed prior to any arithmetic operations on the resource data, similar to the requirement for conversion to linear prior to filtering and blending arithmetic operations.

    typedef struct D3D11DDIARG_RESOLVESUBRESOURCEIN
    {
        D3D11DDI_HRESOURCE hDstResource; // in: resource identifier
        UINT DstSubresource; // in: subresource index
        D3D11DDI_HRESOURCE hSrcResource; // in: resource identifier
        UINT SrcSubresource; // in: subresource index
        DXGI_FORMAT ResolveFormat; // in: resolve format
    } D3D11DDIARG_RESOLVESUBRESOURCEIN;
    
        // part of user mode Device interface:
        STDMETHOD( ResolveSubresource )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_RESOLVESUBRESOURCEIN* pResolveSubresourceIn ) = 0;
    

    5.6.7 FlushResource

    This operation identifies a Read-after-Write Hazard on a Resource granularity throughout the usage of a Device Context. This operation will be sent to the driver immediately before the Resource is used as an input in the graphics pipeline, as this is when the hazard is detected. For example, as a Render Target/ Texture transitions from a Render Target to a Texture, FlushResource will identify this transition immediately before the Resource is set as a Texture. FlushResource will identify the Resource, as a whole, and not the individual Subresources involved. It is expected that this operation detects when GPU caches need to be flushed.

    When the pipeline is configured to read from non-overlapping Subresources that are being written to, at the same time non-overlapping Subresources are being read from, FlushResource operations will not be sent for such a Resource. So, the driver should not rely on notifications for this type of condition, as it doesn't appear there is really a Read-after-Write Hazard.

    Additionally, FlushResource should not be expected to be used for to identify any hazards related to shared Resources: same-process cross-Device Context Resources nor cross-process Resources. Whenever a Device Context is swapped for another Device Context, GPU caches should be flushed, as needed, to maintain correct behavior. The only hazards FlushResource exposes are within the same device context.

        // part of user mode Device interface:
        STDMETHOD( FlushResource )( D3D10DDI_HDEVICE hDrvDevice,
            D3D11DDI_HRESOURCE hDrvResource ) = 0;
    

    5.6.8 UpdateSubresourceUP

    If a Subresource was created with flags preventing the CPU to map/ lock and write to the Resource, the Subresource may still be able to be modified with UpdateSubresourceUP, as these concepts are mutually exclusive.

    UpdateSubresourceUP may not be used when the Resource was created with flags allowing the CPU to map/ lock the Resource. It also may not be used with Resources that can be used as Depth/ Stencil, nor for multisampled Resources.

    Partial updates of ConstantBuffers are disallowed, so when modifying ConstantBuffers with UpdateSubresourceUP, the update box will always be NULL.

    UpdateSubresource works with structured buffers as a destination. The source data is interpreted as an array of structures of the destination’s stride. If necessary, any conversion of the data to a different layout must happen during the update process. It is only valid to update ranges of complete structures. If the bounds of the region being updated are not a range of complete structures, the runtime will fail the update operation.

    typedef struct D3D11DDIARG_UPDATESUBRESOURCEUPIN
    {
        D3D11DDI_HRESOURCE hDstResource;   // in: resource identifier
        UINT32            DstSubresource; // in: zero based subresource index
        CONST D3D11_BOX*    pDstBox;        // in: update box
        CONST VOID*       pSrcUPData;     // in: data pointer
        SIZE_T            SrcPitch;       // in: data pitch
        SIZE_T            SrcSlicePitch;  // in: data slice pitch
    } D3D11DDIARG_UPDATESUBRESOURCEUPIN;
    
        // part of user mode Device interface:
        STDMETHOD( UpdateSubresourceUP )( D3D10DDI_HDEVICE hDrvDevice,
            CONST D3D11DDIARG_UPDATESUBRESOURCEUPIN* pUpdateSubresourceUPIn ) = 0;
    

    5.6.9 UpdateSubresource and CopySubresourceRegion with NO_OVERWRITE or DISCARD

    This is a new variant of the UpdateSubresource() and CopySubresourceRegions APIs (which both update a portion of a GPU surface) for D3D1.1. The addition is a Flags field where NO_OVERWRITE or DISCARD can be specified. A separate new feature that also affects UpdateSubresource is that it now allows overlapping copies.

        void UpdateSubresource1(
            ID3D11Resource* pDstResource,
            UINT DstSubresource,
            const D3D11_BOX* pDstBox,
            const void* pSrcData,
            UINT SrcRowPitch,
            UINT SrcDepthPitch
            UINT CopyFlags ); // new CopyFlags parameter where D3D11_COPY_NO_OVERWRITE,
                              // D3D11_COPY_DISCARD, or nothing can be specified.
    
       void CopySubresourceRegion1(
            ID3D11Resource* pDstResource,
            UINT DstSubresource,
            UINT DstX,
            UINT DstY,
            UINT DstZ,
            ID3D11Resource* pSrcResource,
            UINT SrcSubresource,
            const D3D11_BOX* pSrcBox,
            UINT CopyFlags ); // new CopyFlags parameter where D3D11_COPY_NO_OVERWRITE,
                              // D3D11_COPY_DISCARD, or nothing can be specified.
    

    Specifying NO_OVERWRITE means that the system can assume that existing references to the surface that may be in flight on the GPU will not be affected by the update, so the copy can proceed immediately (avoiding either a batch flush or the system maintaining multiple copies of the resource behind the scenes).

    DISCARD means that the system may discard the entire contents of the destination memory outside the region being updated.

    Before the first call with NO_OVERWRITE on a deferred context, a DISCARD must be done on the same context (via Copy*()/Update*()/Map() API flag or Discard*() API). This is not required on immediate contexts if the application knows the GPU is finished with the resource (though discard can be used if not).

    Tile based deferred renderering (TBDR) GPUs might particularly benefit from this. They are always running multiple passes over the same command buffer, so any resource that is updated in the middle of rendering has to be maintained in the driver in a before and after state, or the tiling pass has to end before the resource update is performed (which is a very expensive tile flush operation).

    These APIs will drive not only the D3D11.1 DDI but also D3D9 DDIs. So new drivers for any DX9+ hardware would have to support/understand revised BLT, BUFBLT, VOLBLT and TEXBLT DDIs adding the flags discussed here.

    These are also required to be supported for all D3D10+ hardware with D3D11.1 drivers.

    The implementation of system to video blts is critical for good performance in Direct2D text rendering. Drivers that expose the cap bit indicating that they are a tile-based renderer will see encounter the following situation during Direct2D text rendering:

    When drivers encounter this scenario, they should implement the copy with the CPU synchronously. The NoOverWrite or Discard flag specified in the blt call can be used by the driver to map the destination surface for CPU access. These flags also enable drivers to implement this blt without a mid-scene flush. Drivers that implement this blt asynchronously (with either the CPU or the GPU) will see slowdowns when Direct2D attempts to map the system memory surface in the future.

    Drivers on immediate-mode GPUs are free to implement system to video blts asynchronously.

    5.7 Resource Discard

    DiscardResource() and DiscardView() API/DDIs (the latter allowing rects to be specified) allow applications to specify the contents of a resource (or the subset of it that is in a View) may be discarded. This is be reflected in both the D3D11.1 and D3D9 DDIs. The D3D9 DDI does not have Views, but does support limited subsetting of resources, so that is reflected in the new D3D9 Discard DDI (documented elsewhere).

    On some GPUs with tile based deferred rendering (TBDR) architectures, binding RenderTargets that already have contents in them (from previous rendering) incurs a cost for having to copy the RenderTarget contents back into tile memory for rendering. If the application knows it is going to cover the entire surface anyway with new data, the copy is not needed.

    On TBDRs a copy from tile memory back out can sometimes also be avoided. For example if a Multisampled RTV is Resolve()'d and then Discard()ed, the implementation may be able to resolve as each tile is finished wihtouth having to write out the full multisampled tile data. Specifying Discard() right away rather than waiting to specify discard on binding the resource later requires less look-ahead for the driver to know what it can do.

    Multi-GPU systems can also benefit from discard semantics, such as in cases where separate frames are rendered on different GPUs, avoiding the need for cross-GPU data copies.


    5.8 Per-Resource Mipmap Clamping


    Section Contents

    (back to chapter)

    5.8.1 Intro
    5.8.2 API Access
    5.8.3 Mipmap Number Space
    5.8.4 Fractional Clamping
    5.8.5 Empty-Set Cases
    5.8.6 Per-Resource Clamp Examples

    5.8.6.1 Case 1: Per-resource Clamp falls within SRV and Sampler Clamp
    5.8.6.2 Case 2: Per-Resource Clamp falls within SRV, but outside Sampler clamp
    5.8.6.3 Case 3: Per-Resource Clamp falls outside SRV
    5.8.7 Effects Outside ShaderResourceViews


    5.8.1 Intro

    D3D11 includes a way for applications to prevent some of the mipmaps in a resource from being accessible via the 3D pipeline (by clamping the mipmaps). This mechanism operates per-resource, as opposed to per-sampler(7.18.2) or per-ShaderResourceView, allowing applications a convenient way to globally control the GPU memory footprint that is referenced at any point. Drivers can easily take advantage of these per-resource clamps since they know that clamped off miplevels do not have to be resident in GPU memory.

    5.8.2 API Access

    Each resource (such as a texture2D) that an application creates will have a method on its interface that queues a D3D command setting a float32 scalar global MinLOD clamp for all Shader Resource Views of that resource. The fact that the command is queued means it does not affect the behavior of anything ahead of it in the queue.

    Recall that lower LOD values define the more detailed mipmaps in a mipmap chain, so applying a MinLOD clamp has the effect of clamping off the most detailed miplevel(s).

    The per-resource global MinLOD clamp applies to any reference to the resource from a shader via a Shader Resource View, such as using sample* or ld*instructions. Note that Sampler(7.18.2) objects already contain a fixed MinLOD and MaxLOD clamp, honored by instructions that take a Sampler as an operand such as sample*. The per-resource MinLOD clamp has the same effect as the Sampler MinLOD clamp (both clamps are applied), except each has a different number space for identifying mipmaps.

    5.8.3 Mipmap Number Space

    The per-resource MinLOD clamp considers the most detailed mipmap on the resource as LOD 0, so specifying a MinLOD clamp of 1 causes miplevel 0 on the resource to be ignored. On the other hand, the Sampler’s MinLOD clamp defines most detailed mipmap in the current Shader Resource View as LOD 0. So on a Shader Resource View that, for example, limits a mipmap chain to exclude the most detailed 3 mips from a resource, setting the Sampler MinLOD to 1 causes miplevel [3] (the fourth mip) in the resource to be ignored.

    5.8.4 Fractional Clamping

    The per-resource MinLOD clamp can be fractional (like the Sampler(7.18.2) MinLOD clamp) – this is useful with linear mipmap filtering. For example suppose the per-resource MinLOD clamp is 1.1, and the current Shader Resource View is the entire mipchain. Texture filters would behave as if the most detailed mipmap available is a blend of 90% of mipmap [1] and 10% of mipmap [2]. Both mipmap [1] and [2] would have to be resident on the GPU. A way to make use of the fractions is to start with a high MinLOD clamp (limiting the memory footprint enough to prevent stalling on texture upload to the GPU), and gradually lowing the MinLOD clamp on the resource over time, allowing the driver/hardware more time to make all of the resource resident. Visually there would be no popping, as the influence of more detailed mipmaps is blended in.

    A fractional per-resource MinLOD clamp basically requires the floor of the MinLOD miplevel and the less detailed miplevels to be resident. In the example above with a per-resource MinLOD clamp of 1.1, if a ld instruction requests data from miplevel [1], it will be resident.

    As another example, consider the same Shader Resource View with a full mipchain, but a MinLOD clamp of 0.1. The gather4(22.4.2) instruction is defined to operate on mip 0 in the view only (otherwise an out of bounds result is returned). But since the clamp of 0.1 requires mip 0 to be present, gather4 will fetch from mip 0.

    5.8.5 Empty-Set Cases

    Suppose a ShaderResourceView on a resource is defined which limits the miplevels visible in the resource. Now suppose a per-resource MinLOD clamp is set such that the intersection of the remaining active miplevels after the clamp, with the miplevels used in a ShaderResourceView, is empty. e.g using a ShaderResourceView of mipmaps 0..3 on a resource along with a resource MinLOD clamp of 5. The result of fetching from the ShaderResourceView with such an empty intersection with the per-resource clamp is the defined out-of-bounds access result. That is, 0 is returned for all non-missing components of the format of the resource, and the default is provided for missing components. The lod(22.5.6) instruction returns 0 for the clamped LOD in this empty-set case.

    If a texture has 6 mip levels (0..5) and the MinLOD clamp is set to any value past the least detailed mip in the view (e.g. 5.1), the out of bounds behavior applies. This is an exception to the rule that the floor of the MinLOD clamp is required to be present.

    Shader ld*(22.4.6) instructions, which do not perform filtering, and which access miplevels directly, also honor the per-resource MinLOD clamp. This is unlike the MinLOD clamp in Sampler state, since ld* instructions do not use samplers. The previous section has an example illustrating how ld behaves with a fractional clamp.

    If sample*(22.4.15) instructions that explicitly provide a miplevel to fetch from, such as sample_l(22.4.18), request a miplevel that is clamped off by a per-resource MinLOD clamp (where the per-resource clamp still falls within the View), the result of the fetch is the same as what happens with sampler clamping; that is the most detailed available clamped mip (after both sampler and MinLOD clamp) is used.

    When sampling using a Sampler(7.18.2) configured to use BorderColor, accessing the border region of a mipmap that has been clamped off due to MinLOD clamp, the result is the out of bounds behavior (as opposed to returning the border color).

    5.8.6 Per-Resource Clamp Examples

    5.8.6.1 Case 1: Per-resource Clamp falls within SRV and Sampler Clamp

    Initial Conditions:

    Resource: 8 miplevels [0..7]
    Shader Resource View: [1..6] (so mip 0 in the view is mip 1 on the resource.  In View space this is [0..5])
    Sampler MinLOD = 1.2, MaxLOD = 4 (this is in the View mip number space)
    Sampler filter mode: MIN_MAG_MIP_LINEAR
    Per-Resource MinLOD clamp = 3.5 (this is in the Resource mip number space)
    

    Some results:

    5.8.6.2 Case 2: Per-Resource Clamp falls within SRV, but outside Sampler clamp

    Initial Conditions:

    Resource: 8 miplevels [0..7]
    Shader Resource View: [1..6] (so mip 0 in the view is mip 1 on the resource.  In view space this is [0..5])
    Sampler MinLOD = 1.2, MaxLOD = 4 (this is in the View mip number space)
    Sampler filter mode: MIN_MAG_MIP_LINEAR
    Per-Resource MinLOD clamp = 5.5 (this is in the Resource mip number space)
    

    Some results:

    5.8.6.3 Case 3: Per-Resource Clamp falls outside SRV

    Initial Conditions:

    Resource: 8 miplevels [0..7]
    Shader Resource View: [1..6] (so mip 0 in the view is mip 1 on the resource.  In view space this is [0..5])
    Sampler MinLOD = 1.2, MaxLOD = 4 (this is in the View mip number space)
    Sampler filter mode: MIN_MAG_MIP_LINEAR
    Per-Resource MinLOD clamp = 6.5 (this is in the Resource mip number space)
    

    Some results:

    5.8.7 Effects Outside ShaderResourceViews

    Per-resource MinLOD clamps only affect the behavior of ShaderResourceView accesses from shader code – such as sample* and ld*instructions discussed so far.

    Other operations on the resource are unaffected by per-resource MinLOD clamps, including reading and/or writing via RenderTargetViews, DepthStencilViews, or resource manipulation APIs such as CopySubresourceRegion, UpdateResource or GenerateMips. Any such reference to the contents of a resource, i.e. NOT through a ShaderResourceView, requires the system to make appropriate memory resident for the requested operation to proceed as expected, unaffected by per-resource MinLOD clamping.

    The behavior of the resinfo instruction wrt. Per-resource MinLOD clamp is defined within the instruction's definition(22.4.14).


    5.9 Tiled Resources


    Section Contents

    (back to chapter)

    5.9.1 Overview

    5.9.1.1 Purpose
    5.9.1.2 Background and Motivation
    5.9.2 Creating Tiled Resources
    5.9.2.1 Creating the Resource
    5.9.2.2 Mappings are into a Tile Pool
    5.9.2.2.1 Tile Pool Creation
    5.9.2.2.2 Tile Pool Resizing
    5.9.2.2.3 Hazard Tracking vs. Tile Pool Resources
    5.9.2.3 Tiled Resource Creation Parameters
    5.9.2.3.1 Address Space Available for Tiled Resources
    5.9.2.4 Tile Pool Creation Parameters
    5.9.2.5 Tiled Resource Cross Process / Device Sharing
    5.9.2.5.1 Stencil Formats Not Supported with Tiled Resources
    5.9.2.6 Operations Available on Tiled Resource
    5.9.2.7 Operations Available on Tile Pools
    5.9.2.8 How a Tiled Resource's Area is Tiled
    5.9.2.8.1 Texture1D[Array] Subresource Tiling - Designed But Not Supported
    5.9.2.8.2 Texture2D[Array] Subresource Tiling
    5.9.2.8.3 Texture3D Subresource Tiling
    5.9.2.8.4 Buffer Tiling
    5.9.2.8.5 Mipmap Packing
    5.9.3 Tiled Resource APIs
    5.9.3.1 Assigning Tiles from a Tile Pool to a Resource
    5.9.3.2 Querying Resource Tiling and Support
    5.9.3.3 Copying Tiled Data
    5.9.3.3.1 Note on GenerateMips()
    5.9.3.4 Resize Tile Pool
    5.9.3.5 Tiled Resource Barrier
    5.9.4 Pipeline Access to Tiled Resources
    5.9.4.1 SRV Behavior with Non-Mapped Tiles
    5.9.4.2 UAV Behavior with Non-Mapped Tiles
    5.9.4.3 Rasterizer Behavior with Non-Mapped Tiles
    5.9.4.3.1 DepthStencilView
    5.9.4.3.2 RenderTargetView
    5.9.4.4 Tile Access Limitations With Duplicate Mappings
    5.9.4.4.1 Copying Tiled Resources With Overlapping Source and Dest
    5.9.4.4.2 Copying To Tiled Resource with Duplicated Tiles in Dest Area
    5.9.4.4.3 UAV Accesses to Duplicate Tiles Mappings
    5.9.4.4.4 Rendering After Tile Mapping Changes Or Content Updates from Outside Mappings
    5.9.4.4.5 Rendering To Tiles Shared Outside Render Area
    5.9.4.4.6 Rendering To Tiles Shared Within Render Area
    5.9.4.4.7 Data Compatibility Across Tiled Resources Sharing Tiles
    5.9.4.5 Tiled Resources Texture Sampling Features
    5.9.4.5.1 Overview
    5.9.4.5.2 Shader Feedback About Mapped Areas
    5.9.4.5.3 Fully Mapped Check
    5.9.4.5.4 Per-sample MinLOD Clamp
    5.9.4.5.5 Shader Instructions
    5.9.4.5.6 Min/Max Reduction Filtering
    5.9.4.6 HLSL Tiled Resources Exposure
    5.9.5 Tiled Resource DDIs
    5.9.5.1 Resource Creation DDI: D3D11DDIARG_CREATERESOURCE
    5.9.5.2 Texture Filter Descriptor: D3D10_DDI_FILTER
    5.9.5.3 Structs used by Tiled Resource DDIs
    5.9.5.4 DDI Functions
    5.9.6 Quilted Textures - For future consideration only
    5.9.6.1 Sampling Behavior for Quilted Textures
    5.9.7 Tiled Resources Features Tiers
    5.9.7.1 Tier 1
    5.9.7.1.1 Limitations affecting Tier 1 only
    5.9.7.2 Tier 2
    5.9.7.3 Some Future Tier Possibilities
    5.9.7.4 Capability Exposure
    5.9.7.4.1 Tiled Resources Caps
    5.9.7.4.2 Multisampling Caps


    5.9.1 Overview


    5.9.1.1 Purpose

    This spec is for "Tiled Resources" in D3D. Other terms that have been used for the same concept are "Sparse Textures" and "Partially Resident Textures"

    This document outlines what might be expected of D3D implementations if this hypothetical feature was included in a future version of D3D.


    5.9.1.2 Background and Motivation

    Recall that all D3D memory allocations are managed at subresource granularity (in a system without Tiled Resource support). For a Buffer, the entire Buffer is the subresource. For a Texture, each mip level is a subresource (at a given array slice if it is a Texture Array). The graphics system (OS, driver, hardware) only expose the ability to manage the mapping of allocations at this subresource granularity. "Mapping", in the context of Tiled Resources in this spec, refer to making data visible to the GPU.

    Suppose an application knows that a particular rendering operation only needs to access a small portion of an image mipmap chain (perhaps not even the full area of a given mipmap). Ideally the system could be told about this and only bother to ensure that the needed memory is mapped on the GPU without paging in too much. In reality, the system can only be informed about what memory needs to be mapped on the GPU at subresource granularity (i.e. a range of full mipmap levels that could be accessed). There is no demand faulting in the graphics system either, so potentially a lot of excess GPU memory needs to be used make full subresources mapped before a rendering command that references any part of the memory is executed. This is just one issue that makes the use of large memory allocations difficult in D3D.

    D3D11 supports Texture2D surfaces with up to 16384 pixels on a given side. An image that is 16384 wide by 16384 tall and 4 bytes per pixel would consume 1GB of video memory (and adding mipmaps would double that). In practice it is unlikely/rare that all 1GB would need to be referenced in a single rendering operation.

    Some game developers are now modeling terrain surfaces as large as 128K by 128K. The way they get this to work on existing GPUs is to break the surface into tiles that are small enough for hardware to handle. The application must figure out which tiles might be needed and load them into a cache of textures on the GPU - a software paging system. A significant downside to this approach comes from the hardware not knowing anything about the paging that is going on: When a part of an image needs to be shown on screen that straddles tiles, the hardware does not know how to perform fixed function (i.e. efficient) filtering across tiles. This means the application managing its own software tiling must resort to manual texture filtering in shader code (which becomes very expensive if a good quality anisotropic filter is desired) and/or waste memory authoring gutters around tiles that contain data from neighboring tiles so that fixed function hardware filtering can continue to provide some assistance.

    If a Tiled representation of surface allocations could be a 1st class feature in the graphics system, the application could tell the hardware which tiles to make available. So (a) less GPU memory is wasted storing regions of surfaces that the application knows will not be accessed, and (b) the hardware can understand how to filter across adjacent tiles, alleviating some of the pain experienced by developers doing software tiling today.

    But to provide a complete solution, something must be done to deal with the fact that, independent of whether tiling within a surface is supported, the maximum surface dimension is currently 16384 - nowhere near the 128K+ that applications already want. Just requiring the hardware to support larger texture sizes is one approach, however there are significant costs and/or tradeoffs to going this route. D3D11's texture filter path and rendering path are already saturated in terms of precision in supporting 16K textures with the other requirements, such as supporting viewport extents falling off the surface during rendering, or supporting texture wrapping off the surface edge during filtering. A possibility is to define a tradeoff such that as the texture size increases beyond 16K, functionality/precision is given up in some manner. Even with this concession however, additional hardware costs may be required in terms of addressing capability thoughout the hardware system to go to larger texture sizes.

    One issue that comes into play as textures get very large is that single precision floating point texture coordinates (and the associated interpolators to support rasterization) run out of precision to specify locations on the surface accurately. Jittery texture filtering would ensue. One expensive option would be to require double precision interpolator support, though that could be overkill given a reasonable alternative - discussed later.

    Regardless of whether the supported texture size may be increased above 16K, if there is some limit that is arrived at that is not magnitudes larger, the question would still remain: What if the application wants a surface even larger than whatever limit is in place? A reasonable approach could be to "Quilt" these large textures manually, independent of the Tiling within each texture. This document covers an approach along these lines. This might also mitigate a lack of double precision attribute interpolation.

    The reason for one of the alternate names for this is "Sparse Texture" is that "Sparse" conveys both the Tiled nature of the resources as well as the perhaps the primary reason for Tiling them - that not all of them are expected to be mapped at once. In fact, it is conceivable that an application could author a Sparse/Tiled Resource in which no data is authored for all regions+mips of the resource, intentionally. So the content itself could be sparse, and the mapping of the content in GPU memory at a given time would be a subset of that (even more sparse).

    Another scenario that could be served by Tiled Resources is enabling multiple Resources of different dimensions/formats to share the same memory. Sometimes applications have exclusive sets of resources that are known not to be used at the same time, or resources that are created only for very brief use and then destroyed, followed by creation of other resources. A form of generality that can fall out of "Tiled Resources" is that it is possible to allow the user to point multiple different resources at the same (overlapping) memory. In other words, the creation and destruction of "resources" (which define a dimension/format etc.) can be decoupled from the management of the memory underlying the resources from the application's point of view.

    The rest of this section dives into the details required to define "Tiled Resources" in the context of D3D.


    5.9.2 Creating Tiled Resources


    5.9.2.1 Creating the Resource

    To create a Tiled Resource, the flag D3D11_RESOURCE_MISC_TILED has to be specified as a MiscFlag on the Create* call. Restrictions on when this flag can be used are described later.

    Whereas a non-Tiled Resource's storage is allocated in the system when the resource is created (e.g. CreateTexture2D API call), for a Tiled Resource, the storage for the Resource contents is not allocated. Instead, when a Tiled Resource is created at the API, the system makes an address space reservation for the tiled surface's area only, and then allows the mapping of the tiles to be controlled by the application. The "mapping" of a tile is simply the physical location in memory that a logical tile in a resource points to (or NULL for an unmapped tile). This is not to be confused with the notion of mapping a D3D resource for CPU access, which despite using the same name is completely independent. The developer will be able to define and change the mapping of each tile individually as needed, knowing that all tiles for a surface don't need to be mapped at a time, thereby making effective use of the amount of memory available.


    5.9.2.2 Mappings are into a Tile Pool

    When the flag D3D11_RESOURCE_MISC_TILED is specified on a resource, the tiles that make up the resource come from pointing at locations in a Tile Pool. A Tile Pool is a pool of memory (backed by one or more allocations behind the scenes - unseen by the application) that simple to manage by the operating system / driver and whose memory footprint is easily understood by an application. Tiled Resources map 64KB regions by pointing to locations in a Tile Pool. One fallout of this setup is it allows multiple Resources to share/reuse the same tiles, and also for the same tiles to be reused at different locations within a Resource if desired.

    The cost for the flexibility of populating the tiles for a Resource out of a Tile Pool is that the Resource has to do the work of defining and maintaining the mapping of which tiles in the Tile Pool represent the tiles needed for the Resource. Tile mappings can be changed. Also, not all tiles in a Resource need to be mapped at a time; it is a feature to be able to have NULL mappings - that is the definition of a tile not being available from the point of view of the Resource accessing it.

    Multiple Tile Pools can be created, and any number of Tiled Resources can map into any given Tile Pool at the same time. Tile Pools can also be grown or shunk (see Resizing Tile Pools(5.9.2.2.2) for details). One constraint, existing merely to simplify driver and runtime implementation, is that a given Tiled Resource may only have mappings into at most one Tile Pool at a time (as opposed to having simultaneous mapping to multiple Tile Pools).

    The amount of storage associated with a Tiled Resource itself (independent Tile Pool memory) should be roughly proportional to the number of tiles actually mapped to the pool at any given time. In hardware this boils down to scaling the memory footprint for page table storage roughly with the amount of tiles that are mapped (e.g. using a multilevel page table scheme as appropriate).

    The Tile Pool can be thought of as an entirely software abstraction that enables D3D applications to effectively be able to program the page tables on the GPU without having to know the low level implementation details (or deal with pointer addresses directly). Tile Pools do no apply any additional levels of indirection in hardware. Optimizations of a single level page table using constructs like page directories are independent of the Tile Pool concept.

    Let us explore what storage the page table itself could require in the worst case (though in practice implementations should only require storage roughly proportional to what is mapped).

    Suppose each page table entry is 64 bits.

    For the worst-case page table size hit for a single surface, given the resource limits in D3D11, suppose a Tiled Resource is created with a 128 bit-per-element format (e.g. RGBA float), so a 64KB tile contains only 4096 pixels. The maximum supported Texture2DArray size of 16384*16384*2048 (but with only a single mipmap) would require about 1GB of storage in the page table if fully populated (not including mipmaps) using 64 bit table entries. Adding mipmaps would grow the fully-mapped (worst case) page table storage by about a third, to about 1.3GB.

    This would gives access to about 10.6 terabytes of addressable memory. There may will be a limit on the amount of addressable memory however, which would reduce these amounts, perhaps to around the terabyte range.

    Another case to consider is a single Texture2D Tiled Resource of 16384*16384 with a 32 bit-per-element format, including mipmaps. The space needed in a fully populated page table would be roughly 170KB with 64 bit table entries.

    Finally, consider an example using a BC format, say BC7 with 128 bits per tile of 4x4 pixels. That is one byte per pixel. A Texture2DArray of 16384*16384*2048 including mipmaps would require roughly 85MB to fully populate this memory in a page table. That is not bad considering this allows one Tiled Resource to span 550 gigapixels (512 GB of memory in this case).

    In practice nowhere near these full mappings would be defined given that the amount of physical memory available wouldn't allow anywhere near that much to be mapped and referenced at a time anyway. With a tile pool, however, applications could choose to reuse tiles (as a simple example, reusing a "black" colored tile for large black regions in an image) - effectively using the Tile Pool (i.e. page table mappings) as a tool for memory compression.

    The initial contents of the page table are NULL for all entries. Applications also can't pass initial data for the memory contents of the surface since it starts off with no memory backing.


    5.9.2.2.1 Tile Pool Creation

    Applications can create one or more Tile Pools per D3D device. The total size of a given Tile Pool is be restricted to D3D11's resource size limit, which is roughly 1/4 of GPU ram.

    A Tile Pool is made of 64KB tiles, but the operating system (driver) manages the entire pool as one or more allocations behind the scenes - the breakdown is not visible to applications. Tiled Resources define content by pointing at tiles within a Tile Pool. Unmapping a tile from a Tiled Resource is done simply by pointing it to NULL. Such unmapped tiles have rules about the behavior of reads or writes (defined later).

    A Tile Pool is created via the CreateBuffer API using a flag to indicate it is a tile pool.


    5.9.2.2.2 Tile Pool Resizing

    A ResizeTilePool()(5.9.3.4) API allows a Tile Pool to be grown if the application needs more working set for the Tiled Resource(s) mapping into it, or shunk if less space is needed. Another options for applications is to allocate additional Tile Pools for new Tiled Resources, however if any singe Tiled Resource needs more space than initially available in its Tile Pool, growing the Tile Pool is a good option. A Tiled Resource can't have mappings into multiple Tile Pools at once.

    When a Tile Pool is grown, additional Tiles are added to the end via one or more new allocations by the driver (breakdown into allocations not visible to the application). Existing memory in the Tile Pool is left untouched and existing Tiled Resource mappings into that memory remain intact.

    When a Tile Pool is shrunk, tiles are removed from the end (this is allowed even below the initial allocation size, down to 0), meaning new mappings cannot be made past the new size. Existing mappings past the end of the new size, however, remain intact and useable, and Drivers will keep the memory around as long as mappings to any part of the allocation(s) the driver uses for the Tile Pool memory remains. If after shrinking, some memory has been kept alive because Tile Mappings are pointing to it and the Tile Pool is regrown, again (by any amount), the existing memory is reused first before any additional allocations occur to service the size of the grow operation.

    To be able to save memory, an application has to not only shrink a Tile Pool but also remove/remap existing mappings past the end of the new smaller Tile Pool size.

    The act of shrinking (and removing mappings) doesn't necessarily produce immediate memory savings. Freeing of memory depends on how granular the driver's underlying allocations for the Tile Pool are - when shrinking happens to be enough to make a driver allocation unused, the driver can free it. If a Tile Pool was grown, it is most likely that shrinking to previous sizes (and removing/remapping tile mappings correspondingly) will yield memory savings, though not guaranteed in the case that the sizes don't exactly align with the underlying allocation sizes chosen by the driver.


    5.9.2.2.3 Hazard Tracking vs. Tile Pool Resources

    For non-Tiled Resources, D3D is able to prevent certain hazard conditions during rendering. For example, the D3D runtime does not allow any given SubResource to be bound as an input (such as a ShaderResourceView) and as an output (such as a RenderTargetView) at the same time. If such a case is encountered, the runtime unbinds the input. This tracking overhead in the runtime is cheap and is done at the SubResource level. One of the benefits of this is to minimize the chances of applications accidentally depending on hardware shader execution order - something that could vary if not on a given GPU, certainly would vary across different GPUs.

    It may, however, be too expensive to do similar work on a per-tile level that may be necessary for Tiled Resources, since tracking would be at a tile level. New issues arise such as possibly validating away attempts to render to an RTV with one tile mapped to multiple areas in the surface simultaneously. If it turns out this per-tile hazard tracking is too expensive for the D3D runtime, ideally this would at least be an option in the Debug Layer.

    Applications are required to inform the driver when it has issued a write or read to a tiled resource that refrences tile pool memory that will also be referenced by separate tiled resources in upcoming read or write operations and is expecting the first operations to complete before the second can begin. See the TiledResourceBarrier()(5.9.3.5) command.


    5.9.2.3 Tiled Resource Creation Parameters

    There are some constraints on the type of D3D resources allowed to be created with the D3D11_RESOURCE_MISC_TILED flag. The valid parameters are:

    Supported Resource Type: Texture2D[Array] (incl. TextureCube[Array], which is a variant of Texture2D[Array]), Buffer (not Texture1D[Array] or Texture3D - Texture3D expected for future).

    Supported Resource Usage: D3D11_USAGE_DEFAULT (not: _DYNAMIC, _STAGING or _IMMUTABLE).

    Supported Resource Misc Flags: D3D11_RESOURCE_MISC_TILED (by definition), _MISC_TEXTURECUBE, _DRAWINDIRECT_ARGS, _BUFFER_ALLOW_RAW_VIEWS, _BUFFER_STRUCTURED, _RESOURCE_CLAMP, _GENERATE_MIPS (not: _SHARED, _SHARED_KEYEDMUTEX, _GDI_COMPATIBLE, _SHARED_NTHANDLE, _RESTRICTED_CONTENT, _RESTRICT_SHARED_RESOURCE, _RESTRICT_SHARED_RESOURCE_DRIVER, _GUARDED, _TILE_POOL)

    Supported Bind Flags: D3D11_BIND_SHADER_RESOURCE, _RENDER_TARGET, _DEPTH_STENCIL, _UNORDERED_ACCESS (not _CONSTANT_BUFFER, _VERTEX_BUFFER [note that binding a tiled Buffer as an SRV/UAV/RTV is still ok], _INDEX_BUFFER, _STREAM_OUTPUT, _BIND_DECODER, _BIND_VIDEO_ENCODER)

    Supported Formats: All formats that would be available for the given configuration regardless of it being tiled, with some exceptions detailed elsewhere.

    Supported SampleDesc (Multisample count, quality): Whatever would be supported for the given configuration regardless of it being tiled, with some exceptions detailed elsewhere.

    Supported Width/Height/MipLevels/ArraySize:Full extents supported by D3D11. Tiled Resources do not have the restriction on total memory size imposed on non-Tiled Resources - they are only constrained by overall Virtual Address Space limits(5.9.2.3.1).

    The initial contents of Tile Pool memory are undefined.


    5.9.2.3.1 Address Space Available for Tiled Resources

    On 64 bit OSs, at least 40 bits of virtual address space (1 Terabyte) is available.

    For 32 bit OSs, the address space is 32 bit. For 32 bit ARM systems, individual Tiled Resource creation can fail if the allocation would use more than 27 bits of address space (128 MB). This includes any hidden padding in the address space the hardware may use for mipmaps, packed tile padding, and possibly padding surface dimensions to powers of 2.

    On systems with a separate page table for the GPU, most of this address space will be available to GPU resources made by the application, though GPU allocations made by the driver fit in the same space.

    On future systems with a page table shared between the CPU and GPU, the available address space is shared between all CPU and GPU allocations in a process.


    5.9.2.4 Tile Pool Creation Parameters

    Tile Pools are defined by the following application specified properties (via the CreateBuffer API):

    Size: Allocation size, as a multiple of 64KB (0 is valid since there is a Resize operation available).

    Supported Resource Misc Flags: D3D11_RESOURCE_MISC_TILE_POOL (identifies it is a tile pool), D3D11_RESOURCE_MISC_SHARED, _SHARED_KEYEDMUTEX, _SHARED_NTHANDLE

    Supported Resource Usage: D3D11_USAGE_DEFAULT only.


    5.9.2.5 Tiled Resource Cross Process / Device Sharing

    Tile Pools can be shared with other processes just like traditional resources. Tiled Resources (which reference Tile Pools) cannot be shared across devices/processes. However separate processes can create their own Tiled Resources that map to Tile Pool(s) shared between them.

    Shared Tile Pools cannot be resized.


    5.9.2.5.1 Stencil Formats Not Supported with Tiled Resources

    Formats containing stencil are not supported with Tiled Resources.

    This includes DXGI_FORMAT_D24_UNORM_S8_UINT (and related formats in the R24G8 family) and DXGI_FORMAT_D32_FLOAT_S8X24_UINT (and related formats in the R32G8X24 family).

    Some implementations store depth and stencil in separate allocations while others store them together. The problem is that tile management for the two schemes would have to be different, and effort has not gone into coming up with a way to abstract or rationalize the differences in a single API. A recommendation for future hardware is to support independent depth and stencil surfaces, each independently tiled. 32 bit depth would have 128x128 tiles and 8 bit stencil would have 256x256 tiles, so applications would have to live with tile shape misalignment between depth and stencil, but the same problem exists with different RenderTarget surface formats already.


    5.9.2.6 Operations Available on Tiled Resource

    Tile controls are available on immediate or deferred contexts (just like updates to normal Resources) and upon execution impact subsequent accesess to the tiles (not previously submitted operations).


    5.9.2.7 Operations Available on Tile Pools

    Data cannot be copied to/from Tile Pool memory directly. Accesses to the memory are always done through Tiled Resources.


    5.9.2.8 How a Tiled Resource's Area is Tiled

    When a Tiled Resource is created, the dimensions, format element size and number of mipmaps and/or array slices (if applicable) determine the number of tiles that would be required to back the entire surface area. The pixel/byte layout within tiles is implementation-chosen (until such time as a standard layout is defined for future hardware). The number of pixels that fit in a tile, depending on the format element size, is fixed and identical whether using a (future) standard swizzle or not.

    This means that the number of tiles that will be used by a given surface size and format element width is well defined/predictable based on the following tables. For Resources that contain mipmaps, or cases where surface dimensions don't fill a tile, however, there are some constraints, discussed later(5.9.2.8.5).

    Different Tiled Resources can point to the same memory with different formats as long as applications don't rely on the results of writing to the memory with one format and reading with another, unless the formats are in the same format family (have the same typeless parent format) - e.g. R8G8B8A8_UNORM and R8G8B8A8_UINT are compatible with each other but not with R16G16_UNORM. There is one exception where bleeding data from one format aliasing to another is well defined: If a tile completely contains 0 for all its bits can be used with any format that interprets those memory contents as 0 (regardless of memory layout). So a tile could be cleared to 0x00 with the format R8_UNORM and then used with a format like R32G32_FLOAT and it would appear the contents are still (0.0f,0.0f).

    The layout of data within a tile does not depend on where the tile is mapped in a resource overall. So, for example, a tile can be reused in different locations of a surface at once with consistent behavior in all locations.


    5.9.2.8.1 Texture1D[Array] Subresource Tiling - Designed But Not Supported

    (not counting tail mip packing)

    Texture1D[Array] Tiled Resource support was designed as follows but not exposed for lack of utility.

    Bits/Pixel Tile Dimensions (Pixels)
    8 65536
    16 32768
    32 16384
    64 8192
    128 4096
    BC1,4 Not supported
    BC3,5,7 Not supported

    Other format bit counts not supported with Tiled Resources: 96bpp formats, video formats, R1_UNORM, R8G8_B8G8_UNORM, G8R8_G8B8_UNORM.


    5.9.2.8.2 Texture2D[Array] Subresource Tiling

    (not counting tail mip packing)

    Bits/Pixel (1 sample/pixel) Tile Dimensions (Pixels, WxH)
    8 256x256
    16 256x128
    32 128x128
    64 128x64
    128 64x64
    BC1,4 512x256
    BC2,3,5,6,7 256x256

    Other format bit counts not supported with Tiled Resources: 96bpp formats, video formats, R1_UNORM, R8G8_B8G8_UNORM, R8R8_G8B8_UNORM.

    Multisample Count Divide Tile Dimensions Above by (WxH)
    1 1x1
    2 2x1
    4 2x2
    8 4x2
    16 4x4

    Only sample counts 1 and 4 are required (and allowed) to be supported with Tiled Resources. 2, 8, and 16 are shown for future consideration.

    Implementations may choose to support 2, 8, and/or 16 sample MSAA for NON-Tiled Resources even though tiled resource don't support them.

    Tiled Resources with sample counts larger than 1 cannot use 128bpp formats).

    The constraints on supported sample counts and formats are due to hardware inconsistencies from the desired spec at the time of design.


    5.9.2.8.3 Texture3D Subresource Tiling

    (not counting tail mip packing)

    This takes the Texture2D tiling divides the x/y dimensions by 4 each and adds 16 layers of depth. All the tiles for the first plane (2D plane of tiles defining the first 16 layers of depth) appear before the subsequent planes.:

    Texture3D support in Tiled Resources is not exposed in the initial implementation of Tiled Resource, but the desired tile shapes are listed here for consideration in a future release.

    Bits/Pixel (1 sample/pixel) Tile Dimensions (Pixels, WxHxD)
    8 64x32x32
    16 32x32x32
    32 32x32x16
    64 32x16x16
    128 16x16x16
    BC1,4 128x64x16
    BC2,3,5,6,7 64x64x16

    Other format bit counts not supported with Tiled Resources: 96bpp formats, video formats, R1_UNORM, R8G8_B8G8_UNORM, R8R8_G8B8_UNORM.


    5.9.2.8.4 Buffer Tiling

    A Buffer Resource is trivially divided into 64KB tiles, with some empty space in the last tile if the size is not a multiple of 64KB.

    Structured Buffers must have no constraint on the Stride to be Tiled, however possible performance optimizations in hardware for using Structured Buffers may be sacrificed by making them Tiled in the first place.


    5.9.2.8.5 Mipmap Packing

    Depending on the Tier(5.9.7) of Tiled Resources support, mipmaps with certain dimensions do not follow the standard tile shapes and are considered to all be packed together with one another in a manner that is opaque to the application. Higher Tiers of support have broader guarantees about what types of surface dimensions fit in the standard tile shapes (and can therefore be individually mapped by applications).

    What can vary between implementations is that - given a Tiled Resource's dimensions, format, number of mipmaps and array slices - some number M of mips (per array alice) may be packed into some number N tiles. The GetResourceTiling()(5.9.3.2) API exists to allow the driver to report to the application what M and N are (among other details about the surface that this API reports that are standard and do not vary by IHV). The set of tiles for the packed mips are still 64KB and can be individually mapped into disparate locations in a Tile Pool, however the pixel shape of the tiles and how the mipmaps fit across the set of tiles is IHV specific and too complex to expose. So applications are required to either map all of the tiles that are designated as packed, or none of them, at a time. Otherwise the behavior for accessing the Tiled Resource is undefined.

    For arrayed surfaces, the set of packed mips and the number of packed tiles storing those mips (M and N described above) applies individually for each array slice.

    Dedicated APIs for CopyingTiles(5.9.3.3) cannot access packed mips. Applications that wish to copy data to/from packed mips can do so using all the non-Tiled Resource specific APIs for copying and rendering to surfaces.

    For the purposes of populating the contents of mipmapped Tiled Resources for mips that are non packed (use the standard tile shapes) from CPU memory (e.g. Staging memory or user data pointers), there is a well defined CPU-side layout for the tiling of all mipmaps independent of implementation (described in the Copying Tiles(5.9.3.3) section). Implementations can hide any differences in tile breakdown of mipmaps on the GPU side during Copy operations.


    5.9.3 Tiled Resource APIs


    5.9.3.1 Assigning Tiles from a Tile Pool to a Resource

    The following APIs allow manipulation and querying of tile mappings. Update calls only affect the tiles identified in the call, and others are left as defined previously.

    Any given tile from a Tile Pool can be mapped to multiple locations in a Resource and even multiple Resources. This includes tiles in a Resource that have an implementation chosen layout, described earlier, where multiple mipmaps are packed together into a single tile. The catch is that if data is written to the tile via one mapping, but read via a differently configured mapping, the results are undefined. Careful use of this flexibility can still be useful for an application though, like sharing a tile between resources that will not be used simultaneously, where the contents of the tile are always initialized through the same Resource mapping as they will be subsequently read from. Similarly a tile mapped to hold the packed mipmaps of multiple different Resources with the same surface dimensions will work fine - the data will appear the same in both mappings.

    Changes to tile assignments for a Resource can be made at any time in an immediate or deferred context.

    // --------------------------------------------------------------------------------------------------------------------------------
    // Data Structures for Manipulating Tile Mappings
    // --------------------------------------------------------------------------------------------------------------------------------
    
    // For manipulating tile mappings, regions in tiled resources are described by a combination of:
    // (1) tiled resource coordinate (defining the corner of a region) and
    // (2) tile region size (defining the size of a region)
    //
    // These are separated into two structs rather than one so that the various APIs
    // that use them can use different combinations of the parts.
    
    typedef struct D3D11_TILED_RESOURCE_COORDINATE
    {
        // Coordinate values below index tiles (not pixels or bytes).
        UINT X; // Used for buffer, 1D, 2D, 3D
        UINT Y; // Used for 2D, 3D
        UINT Z; // Used for 3D
        UINT Subresource; // indexes into mips, arrays. Used for 1D, 2D, 3D
        // For mipmaps that use nonstandard tiling and/or are packed, any subresource
        // value that indicates any of the packed mips all refer to the same tile.
    };
    
    typedef struct D3D11_TILE_REGION_SIZE
    {
        UINT NumTiles;
        BOOL bUseBox; // TRUE: Uses width/height/depth parameters below to define the region.
                      //   width*height*depth must match NumTiles above.  (While
                      //   this looks like redundant information, the application likely has to know
                      //   how many tiles are involved anyway.)
                      //   The downside to using the box parameters is that one update region cannot
                      //   span mipmaps (though it can span array slices via the depth parameter).
                      //
                      // FALSE: Ignores width/height/depth parameters - NumTiles just traverses tiles in
                      //   the resource linearly across x, then y, then z (as applicable) then spilling over
                      //   mips/arrays in subresource order.  Useful for just mapping an entire resource
                      //   at once, for example.
                      //
                      // In either case, the starting location for the region within the resource
                      // is specified as a separate parameter outside this struct, using x,y,z coordinates
                      // regardless of whether bUseBox above is TRUE or FALSE.
                      //
                      // When the region includes mipmaps that are packed with nonstandard tiling,
                      // bUseBox must be FALSE, since tile dimensions are not standard and the application
                      // only knows a count of how many tiles are consumed by the packed area (which is per
                      // array slice).  The corresponding (separate) starting location parameter uses x to
                      // offset into the flat range of tiles in this case, and y,z coordinates must be 0.
    
        UINT Width;   // In tiles, used for buffer, 1D, 2D, 3D
        UINT16 Height; // In tiles, used for 2D, 3D
        UINT16 Depth; // In tiles, used for 3D or arrays.  For arrays, advancing in depth jumps to next slice
                      // of same mip size, which is not contiguous in the subresource counting space
                      // if there are multiple mips.
    };
    
    typedef enum D3D11_TILE_MAPPING_FLAG
    {
        D3D11_TILE_MAPPING_NO_OVERWRITE = 0x00000001,
    } D3D11_TILE_MAPPING_FLAG;
    
    typedef enum D3D11_TILE_RANGE_FLAG
    {
        D3D11_TILE_RANGE_NULL = 0x00000001,
        D3D11_TILE_RANGE_SKIP = 0x00000002,
        D3D11_TILE_RANGE_REUSE_SINGLE_TILE = 0x00000004,
    } D3D11_TILE_RANGE_FLAG;
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // UpdateTileMappings
    // --------------------------------------------------------------------------------------------------------------------------------
    // UpdateTileMappings adds/removes/changes mappings of tile locations in Tiled Resources to memory locations in a Tile Pool.
    // The API has several modes of operation to enable a few common tasks to be efficiently described.
    //
    // The basic orgainization of the parameters is as follows:
    //
    //      (1) Tiled Resource whose mappings are being updated
    //      (2) Set of Tile Regions on the Tiled Resource whose mappings to update.
    //      (3) Tile Pool providing memory where tile mappings can go.
    //      (4) Set of Tile Ranges where mappings are going: to the Tile Pool in (3), to NULL, and/or other options.
    //      (5) Flags parameter for overall options
    //
    // More detailed breakdown of the parameters:
    //
    // (1) Tiled Resource whose mappings are being updated - resource created with the D3D11_RESOURCE_MISC_TILED flag.
    //     Mappings start off all NULL when a resource is initially created.
    //
    // (2) Set of Tile Regions on the Tiled Resource whose mappings to update.  One API call can update many mappings,
    //     but an application can make multiple calls as well if that is more convenient (with a bit more API call overhead).
    //     NumTiledResourceRegions specifies how many regions there are, pTiledResourceRegionStartCoordinates and
    //     pTiledResourceRegionSizes are each arrays identifying the start location and extend of each region.
    //     If NumTiledResourceRegions is 1, then for convenience either or both of the arrays describing the regions can
    //     be NULL.  NULL for pTiledResourceRegionStartCoordinates means the start coordinate is all 0's, and NULL for
    //     pTiledResourceRegionSizes identifies a default region that is the full set of tiles for the entire Tiled Resource,
    //     including all mipmaps and/or array slices.
    //
    //     If pTiledResourceRegionStartCoordinates is not NULL and pTiledResourceRegionSizes is NULL, then the region
    //     size defaults to 1 tile for all regions.  This makes it easy to define mappings for a set of individual tiles
    //     each at disparate locations by providing an array of locations in pTiledResourceRegionStartCoordinates without
    //     having to send an array of pTiledResourceRegionSizes all set to 1.
    //
    //     The updates are applied from first region to last, so if regions
    //     overlap in a single call, the updates later in the list overwrite the areas overlapping with previous updates.
    //
    // (3) Tile Pool providing memory where mappings are pointing to.  A Tiled Resource can point to a single Tile Pool
    //     at a time.  If a new Tile Pool is specified (for the first time or different
    //     from the last time a Tile Pool was specified), all existing tile mappings for the Tiled Resource are cleared
    //     and the new set of mappings in the current call are applied for the new Tile Pool.
    //     If no Tile Pool is specified (NULL), or the same one as a previous call to UpdateTileMappings is provided,
    //     the call just adds the new mappings to existing ones (overwriting on overlap).
    //     If the call is only defining NULL mappings, no Tile Pool needs to be specified, since it doesn't matter.
    //     But if one is specified anyway it takes the same behavior as described above when providing a Tile Pool.
    //
    // (4) Set of Tile Ranges where mappings are going to.  Each given Tile Range can specify one of a few types of
    //     ranges: a range of tiles in a Tile Pool (default), a count of tiles in the Tiled Resource to map to
    //     to a single tile in a Tile Pool (sharing the tile), a count of tile mappings to in the Tiled Resource to skip
    //     and leave as they are, or a count of tiles in the Tile Pool to map to NULL.
    //
    //     NumRanges specifies the number of Tile Ranges, where the total tiles identified across all ranges
    //     must match the total number of tiles in the Tile Regions from the Tiled Resource described above.
    //     Mappings are defined by iterating through the tiles in the Tile Regions in sequential order - x then y
    //     then z order for box regions - while walking through the set of Tile Ranges in sequential order.
    //     The breakdown of Tile Regions doesn't have to line up with the breakdown of Tile Ranges
    //     - all that matters is the total number of tiles on both sides is equal so that each Tiled Resource tile
    //     specified has a mapping specified.
    //
    //     pRangeFlags, pTilePoolStartOffsets and pRangeTileCounts are all arrays, of size NumRanges, describing the Tile
    //     Ranges.  If pRangeFlags is NULL, all ranges are sequential tiles in the Tile Pool, otherwise for each range i
    //     pRangeFlags[i] identifies how the mappings in that range of tiles work:
    //
    //     If pRangeFlags[i] is 0, that range defines sequential tiles in the Tile Pool, with the number of tiles being
    //     pRangeTileCounts[i] and the starting location pTilePoolStartOffsets[i].  If NumRanges is 1, pRangeTileCounts
    //     can be NULL and defaults to the total number of tiles specified by all the Tile Regions.
    //
    //     If pRangeFlags[i] is D3D11_TILE_RANGE_REUSE_SINGLE_TILE, pTilePoolStartOffsets[i] identifies the single
    //     tile in the Tile Pool to map to, and pRangeTileCounts[i] specifies how many tiles from the Tile Regions to
    //     map to that Tile Pool location.  If NumRanges is 1, pRangeTileCounts can be NULL and defaults to the total
    //     number of tiles specified by all the Tile Regions.
    //
    //     If pRangeFlags[i] is D3D11_TILE_RANGE_NULL, pRangeTileCounts[i] specifies how many tiles from the Tile Regions
    //     to map to NULL.  If NumRanges is 1, pRangeTileCounts can be NULL and defaults to the total
    //     number of tiles specified by all the Tile Regions. pTilePoolStartOffsets[i] is ignored for NULL mappings.
    //
    //     If pRangeFlags[i] is D3D11_TILE_RANGE_SKIP, pRangeTileCounts[i] specifies how many tiles from the Tile Regions
    //     to skip over and leave existing mappings unchanged for.  This can be useful if a Tile Region conveniently
    //     bounds an area of Tile Mappings to update except with some exceptions that need to be left the same as
    //     whatever they were mapped to before. pTilePoolStartOffsets[i] is ignored for SKIP mappings.
    //
    //  (5) Flags: D3D11_TILE_MAPPING_NO_OVERWRITE means the caller promises that previously submitted commands to the
    //      device that may still be executing do not reference any of the tile region being updated.
    //      This allows the device to avoid having to flush previously submitted work in order to do the tile mapping
    //      update.  If the application violates this promise by updating tile mappings for locations in Tiled Resouces
    //      still being referenced by outstanding commands, undefined rendering behavior results, including the potential
    //      for significant slowdowns on some architectures.  This is like the "no overwrite" concept that exists
    //      elsewhere in the API, except applied to Tile Mapping data structure itself (which in hardware is a page table).
    //      The absence of this flag requires that tile mapping updates specified by this call must be completed before any
    //      subsequent D3D command can proceed.
    //
    // Return values:
    //
    // Returns S_OK, E_INVALIDARG, E_OUTOFMEMORY or DXGI_ERROR_DEVICE_REMOVED.  E_OUTOFMEMORY can happen if the call results
    // in the driver having to allocate space for new page table mappings but running out of memory.
    //
    // If out of memory occurs when this is called in a CommandList and the CommandList is being executed, the device will be removed.
    // Applications can avoid this situation by only doing update calls that change existing mappings from Tiled Resources
    // within commandlists (so drivers will not have to allocate page table memory, only change the mapping).
    //
    // Validation remarks:
    //
    // The tile regions specified must entirely fit in the tiled resource or behavior is undefined (debug layer will emit an error).
    // The number of tiles in the tile regions must match the number of tiles in all the tile ranges otherwise the
    // call is dropped with E_INVALIDARG.  Other parameter arrors also result in the call being dropped with E_INVALIDARG - the
    // debug layer provides explanations.
    //
    
    HRESULT
    ID3D11DeviceContext2::
    UpdateTileMappings( _In_ ID3D11Resource* pTiledResource,
                        _In_ UINT NumTiledResourceRegions,
                        _In_reads_opt_(NumTiledResourceRegions) const D3D11_TILED_RESOURCE_COORDINATE* pTiledResourceRegionStartCoordinates,
                        _In_reads_opt_(NumTiledResourceRegions) const D3D11_TILE_REGION_SIZE* pTiledResourceRegionSizes,
                        _In_opt_ ID3D11Buffer* pTilePool,
                        _In_ UINT NumRanges,
                        _In_reads_opt_(NumRanges) const UINT* pRangeFlags,
                        _In_reads_opt_(NumRanges) const UINT* pTilePoolStartOffsets,  // 0 based tile offsets
                                                                                      // counting in tiles (not bytes)
                        _In_reads_opt_(NumRanges) const UINT* pRangeTileCounts,
                        _In_ UINT Flags
                        );
    
    // ----------------------------------------------------------
    // Here are some examples of common UpdateTileMappings cases:
    // ----------------------------------------------------------
    //
    // ----------------------------------------------
    // Clearing an entire surface's mappings to NULL:
    // ----------------------------------------------
    // - No-overwrite is specified, assuming it is known nothing else the GPU could be doing is referencing the previous mappings
    // - NULL for pTiledResourceRegionStatCoordinates and pTiledResourceRegionSizes defaults to the entire resource
    // - NULL for pTilePoolStartOffsets since it isn't needed for mapping tiles to NULL
    // - NULL for pRangeTileCounts when NumRanges is 1 defaults to the same number of tiles as the tiled resource region (which is
    //   the entire surface in this case)
    //
    // UINT RangeFlags = D3D11_TILE_MAPPING_NULL;
    // pDeviceContext2->UpdateTileMappings(pTiledResource,1,NULL,NULL,NULL,1,&RangeFlags,NULL,NULL,0,D3D11_TILE_MAPPING_NO_OVERWRITE);
    //
    // -------------------------------------------
    // Mapping a region of tiles to a single tile:
    // -------------------------------------------
    // - This maps a 2x3 tile region at tile offset (1,1) in a Tiled Resource to tile [12] in a Tile Pool
    //
    // D3D11_TILED_RESOURCE_COORDINATE TRC;
    // TRC.X = 1;
    // TRC.Y = 1;
    // TRC.Z = 0;
    // TRC.Subresource = 0;
    //
    // D3D11_TILE_REGION_SIZE TRS;
    // TRS.bUseBox = TRUE;
    // TRS.Width = 2;
    // TRS.Height = 3;
    // TRS.Depth = 1;
    // TRS.NumTiles = TRS.Width * TRS.Height * TRS.Depth;
    //
    // UINT RangeFlags = D3D11_TILE_MAPPING_REUSE_SINGLE_TILE;
    // UINT StartOffset = 12;
    // pDeviceContext2->UpdateTileMappings(pTiledResource,1,&TRC,&TRS,pTilePool,1,&RangeFlags,&StartOffset,
    //                                     NULL,D3D11_TILE_MAPPING_NO_OVERWRITE);
    //
    // ----------------------------------------------------------
    // Defining mappings for a set of disjoint individual tiles:
    // ----------------------------------------------------------
    // - This can also be accomplished in multiple calls.  Using a single call to define multiple
    //   a single call to define multiple mapping updates can reduce CPU call overhead slightly,
    //   at the cost of having to pass arrays as parameters.
    // - Passing NULL for pTiledResourceRegionSizes defaults to each region in the Tiled Resource
    //   being a single tile.  So all that is needed are the coordinates of each one.
    // - Passing NULL for Range Flags defaults to no flags (since none are needed in this case)
    // - Passing NULL for pRangeTileCounts defaults to each range in the Tile Pool being size 1.
    //   So all that is needed are the start offsets for each tile in the Tile Pool
    //
    // D3D11_TILED_RESOURCE_COORDINATE TRC[3];
    // UINT StartOffsets[3];
    // UINT NumSingleTiles = 3;
    //
    // TRC[0].X = 1;
    // TRC[0].Y = 1;
    // TRC[0].Subresource = 0;
    // StartOffsets[0] = 1;
    //
    // TRC[1].X = 4;
    // TRC[1].Y = 7;
    // TRC[1].Subresource = 0;
    // StartOffsets[1] = 4;
    //
    // TRC[2].X = 2;
    // TRC[2].Y = 3;
    // TRC[2].Subresource = 0;
    // StartOffsets[2] = 7;
    //
    // pDeviceContext2->UpdateTileMappings(pTiledResource,NumSingleTiles,&TRC,NULL,pTilePool,NumSingleTiles,NULL,StartOffsets,NULL,D3D11_TILE_MAPPING_NO_OVERWRITE);
    //
    // -----------------------------------------------------------------------------------
    // Complex example - defining mappings for regions with some skips, some NULL mappings
    // -----------------------------------------------------------------------------------
    // - This complex example hard codes the parameter arrays, whereas in practice the
    //   application would likely configure the paramaters programatically or in a data driven way.
    // - Suppose we have 3 regions in a Tiled Resource to configure mappings for, 2x3 at coordinate (1,1),
    //   3x3 at coordinate (4,7), and 7x1 at coordinate (20,30)
    // - The tiles in the regions are walked from first to last, in X then Y then Z order,
    //   while stepping forward through the specified Tile Ranges to determine each mapping.
    //   In this example, 22 tile mappings need to be defined.
    // - Suppose we want the first 3 tiles to be mapped to a contiguous range in the Tile Pool starting at
    //   tile pool location [9], the next 8 to be skipped (left unchanged), the next 2 to map to NULL,
    //   the next 5 to share a single tile (tile pool location [17]) and the remaining
    //   4 tiles to each map to to unique tile pool locations, [2], [9], [4] and [17]:
    //
    // D3D11_TILED_RESOURCE_COORDINATE TRC[3];
    // D3D11_TILE_REGION_SIZE TRS[3];
    // UINT NumRegions = 3;
    //
    // TRC[0].X = 1;
    // TRC[0].Y = 1;
    // TRC[0].Subresource = 0;
    // TRS[0].bUseBox = TRUE;
    // TRS[0].Width = 2;
    // TRS[0].Height = 3;
    // TRS[0].NumTiles = TRS[0].Width * TRS[0].Height;
    //
    // TRC[1].X = 4;
    // TRC[1].Y = 7;
    // TRC[1].Subresource = 0;
    // TRS[1].bUseBox = TRUE;
    // TRS[1].Width = 3;
    // TRS[1].Height = 3;
    // TRS[1].NumTiles = TRS[1].Width * TRS[1].Height;
    //
    // TRC[2].X = 20;
    // TRC[2].Y = 30;
    // TRC[2].Subresource = 0;
    // TRS[2].bUseBox = TRUE;
    // TRS[2].Width = 7;
    // TRS[2].Height = 1;
    // TRS[2].NumTiles = TRS[2].Width * TRS[2].Height;
    //
    // UINT NumRanges = 8;
    // UINT RangeFlags[8];
    // UINT TilePoolStartOffsets[8];
    // UINT RangeTileCounts[8];
    //
    // RangeFlags[0] = 0;
    // TilePoolStartOffsets[0] = 9;
    // RangeTileCounts[0] = 3;
    //
    // RangeFlags[1] = D3D11_TILE_MAPPING_SKIP;
    // TilePoolStartOffsets[1] = 0; // offset is ignored for skip mappings
    // RangeTileCounts[1] = 8;
    //
    // RangeFlags[2] = D3D11_TILE_MAPPING_NULL;
    // TilePoolStartOffsets[2] = 0; // offset is ignored for NULL mappings
    // RangeTileCounts[2] = 2;
    //
    // RangeFlags[3] = D3D11_TILE_MAPPING_REUSE_SINGLE_TILE;
    // TilePoolStartOffsets[3] = 17;
    // RangeTileCounts[3] = 5;
    //
    // RangeFlags[4] = 0;
    // TilePoolStartOffsets[4] = 2;
    // RangeTileCounts[4] = 1;
    //
    // RangeFlags[5] = 0;
    // TilePoolStartOffsets[5] = 9;
    // RangeTileCounts[5] = 1;
    //
    // RangeFlags[6] = 0;
    // TilePoolStartOffsets[6] = 4;
    // RangeTileCounts[6] = 1;
    //
    // RangeFlags[7] = 0;
    // TilePoolStartOffsets[7] = 17;
    // RangeTileCounts[7] = 1;
    //
    // pDeviceContext2->UpdateTileMappings(pTiledResource,NumRegions,TRC,TRS,pTilePool,NumRanges,RangeFlags,
    //                                     TilePoolStartOffsets,RangeTileCounts,D3D11_TILE_MAPPING_NO_OVERWRITE);
    //
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // CopyTileMappings
    // --------------------------------------------------------------------------------------------------------------------------------
    // CopyTileMappings helps with tasks such as shifting mappings around within/across Tiled Resources, e.g. scrolling tiles.
    // The source and dest region can overlap - the result of the copy in this case is as if the source was saved to a temp and then
    // from there writen to the dest, though the implementation may be able to do better.
    //
    // If the dest resource has a different tile pool than the source, any existing mappings in the dest are cleared to NULL
    // and the mappings from the source are applied.  This maintains the rule that a given resource can have mappings into
    // only one tile pool at a time.
    //
    // The Flags field allows D3D11_TILE_MAPPING_NO_OVERWRITE to be specified, means the caller promises that previously
    //      submitted commands to the device that may still be executing do not reference any of the tile region being updated.
    //      This allows the device to avoid having to flush previously submitted work in order to do the tile mapping
    //      update.  If the application violates this promise by updating tile mappings for locations in Tiled Resouces
    //      still being referenced by outstanding commands, undefined rendering behavior results, including the potential
    //      for significant slowdowns on some architectures.  This is like the "no overwrite" concept that exists
    //      elsewhere in the API, except applied to Tile Mapping data structure itself (which in hardware is a page table).
    //      The absence of this flag requires that tile mapping updates specified by this call must be completed before any
    //      subsequent D3D command can proceed.
    //
    // Return Values:
    //
    // Returns S_OK or E_INVALIDARG or E_OUTOFMEMORY.  The latter can happen if the call results in the driver having to
    // allocate space for new page table mappings but running out of memory.
    //
    // If out of memory occurs when this is called in a commandlist and the commandlist is being executed, the device will be removed.
    // Applications can avoid this situation by only doing update calls that change existing mappings from Tiled Resources
    // within commandlists (so drivers will not have to allocate page table memory, only change the mapping).
    //
    // Various other basic conditions such as invalid flags or passing in non Tiled Resources result in call being dropped
    // with E_INVALIDARG.
    //
    // Validation remarks:
    //
    // The dest and the source regions must each entirely fit in their resource or behavior is undefined
    // (debug layer will emit an error).
    //
    
    HRESULT
    ID3D11DeviceContext2::
    CopyTileMappings( _In_ ID3D11Resource* pDestTiledResource,
                      _In_ const D3D11_TILED_RESOURCE_COORDINATE* pDestRegionStartCoordinate,
                      _In_ ID3D11Resource* pSourceTiledResource,
                      _In_ const D3D11_TILED_RESOURCE_COORDINATE* pSourceRegionStartCoordinate,
                      _In_ const D3D11_TILE_REGION_SIZE* pTileRegionSize,
                      _In_UINT Flags
                        // The only flag that can be specified is:
                        // D3D11_TILE_MAPPING_NO_OVERWRITE (see definition under UpdateTileMappings)
                     );
    
    

    APIs for retrieving tile mappings from the device are not included (contrary to general D3D convention) because of the high cost and complexity to implement them in a performant way for what appears to be little value. Applications will have to track this state on their own. Tools scenarios are expected to simply track API state from the time the device was created.


    5.9.3.2 Querying Resource Tiling and Support

    // --------------------------------------------------------------------------------------------------------------------------------
    // GetResourceTiling
    // --------------------------------------------------------------------------------------------------------------------------------
    // GetResourceTiling retrieves information about how a Tiled Resource is broken into tiles.
    //
    
    typedef struct D3D11_SUBRESOURCE_TILING
    {
    	// Each packed mip is individually reported as 0 for WidthInTiles, HeightInTiles and DepthInTiles.
    
        UINT WidthInTiles;
        UINT HeightInTiles;
        UINT DepthInTiles;
        // Total number of tiles in subresources is WidthInTiles*HeightInTiles*DepthInTiles
        UINT StartTileIndexInOverallResource;
    };
    
    // D3D11_PACKED_TILE is filled into D3D11_SUBRESOURCE_TILING.StartTileIndexInOverallResource
    // for packed mip levels, signifying that this entire struct is meaningless (WidthInTiles, HeightInTiles,
    // DepthInTiles are also al set to 0).
    // For packed tiles, the description of the packed mips comes from D3D11_PACKED_MIP_DESC instead.
    const UINT D3D11_PACKED_TILE = 0xffffffff;
    
    
    typedef struct D3D11_TILE_SHAPE
    {
        UINT WidthInTexels;
        UINT HeightInTexels;
        UINT DepthInTexels;
        // Texels are equivalent to pixels.  For untyped Buffer resources, a texel is just a byte.
        // For MSAA surfaces the numbers are still in terms of pixels/texels.
        // The values here are independent of the surface dimensions.  Even if the surface is
        // smaller than what would fit in a tile, the full tile dimensions are reported here.
    
    };
    
    typedef struct D3D11_PACKED_MIP_DESC
    {
        UINT NumPackedMips; // How many mips starting from the least detailed mip are packed (either
                            // sharing tiles or using non standard tile layout).  0 if there no
                            // such packing in the resource.  For array surfaces this value is how many
                            // mips are packed for a given array slice - each array slice repeats the same
                            // packing.
                            // Mipmaps that fill at least one standard shaped tile in all dimensions
                            // are not allowed to be included in the set of packed mips.  Mips with at least one
                            // dimension less than the standard tile shape may or may not be packed,
                            // depending on the IHV.  Once a given mip needs to be packed, all coarser
                            // mips for a given array slice are considered packed as well.
        UINT NumTilesForPackedMips; // If there is no packing this value is meaningless and returns 0.
                                    // Otherwise it returns how many tiles
                                    // are needed to represent the set of packed mips.
                                    // The pixel layout within the packed mips is hardware specific.
                                    // If applications define only partial mappings for the set
                                    // of tiles in packed mip(s), read/write behavior will be
                                    // IHV specific and undefined.
                                    // For arrays this only returns the count of packed mips within
                                    // the subresources for each array slice.
        UINT StartTileIndexInOverallResource; // Offset of the first packed tile for the resource
                                    // in the overall range of tiles.  If NumPackedMips is 0, this
                                    // value is meaningless and returns 0.  Otherwise it returns the
                                    // offset of the first packed tile for the resource in the overall
                                    // range of tiles for the resource.  A return of 0 for
                                    // StartTileIndexInOverallResourcein means the entire resource is packed.
                                    // For array surfaces this is the offset for the tiles containing the packed
                                    // mips for the first array slice.
                                    // Packed mips for each array slice in arrayed surfaces are at this offset
                                    // past the beginning of the tiles for each array slice.  (Note the
                                    // number of overall tiles, packed or not, for a given array slice is
                                    // simply the total number of tiles for the resource divided by the
                                    // resource's array size, so it is easy to locate the range of tiles for
                                    // any given array slice, out of which StartTileIndexInOverallResource identifies
                                    // which of those are packed.)
    };
    
    void
    ID3D11Device2::
    GetResourceTiling( _In_ ID3D11Resource* pTiledResource,
                       _Out_opt_ UINT* pNumTilesForEntireResource, // Total number of tiles needed to store the resource
                       _Out_opt_ D3D11_PACKED_MIP_DESC* pPackedMipDesc, // Mip packing details
                       _Out_opt_ D3D11_TILE_SHAPE* pTileShape, // How pixels fit in tiles, independent of surface dimensions,
                                                               // not including packed mip(s).  If the entire surface is packed,
                                                               // this parameter is meaningless since there is no defined layout
                                                               // for packed mips.  In this case the returned fields are set to 0.
                       _Inout_opt_ UINT* pNumSubresourceTilings, // IN: how many subresources to query tilings for,
                                                                // OUT: returns how many retrieved (clamped to what's available)
                       _In_ UINT FirstSubresourceTilingToGet, // ignored if *pNumSubresourceTilings is 0,
                       _Out_writes_(*pNumSubresourceTilings) D3D11_SUBRESOURCE_TILING* pSubresourceTilings, // Subresources that
                                                              // are part of packed mips return 0 for all of the fields in
                                                              // the corresponding output, except StartTileIndexInOverallResource which is
                                                              // set to D3D11_PACKED_TILE (0xffffffff) - basically indicating the whole
                                                              // struct is meaningless for this case and pPackedMipDesc applies.
                      );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // CheckMultisampleQualityLevels1
    // --------------------------------------------------------------------------------------------------------------------------------
    // CheckMultisampleQualityLevel1 is a variant of the existing CheckMultisampleQualityLevels API that adds a flags field that
    // allows the caller to indicate the query is for a tiled resource.  This allows drivers to report multisample quality levels
    // for tiled resources differently than non-Tiled resources.
    //
    // As with non-tiled Resources, when Multisampling is supported/required for a given format, applications are guaranteed to
    // be able to use the standard or center multisample patterns instead of using one of the driver quality levels.
    //
    typedef enum D3D11_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS
    {
        D3D11_CHECK_MULTISAMPLE_QUALITY_LEVELS_TILED_RESOURCE = 0x00000001,
    };
    
    HRESULT
    ID3D11Device2::
    CheckMultisampleQualityLevels1(
                _In_  DXGI_FORMAT Format,
                _In_  UINT SampleCount,
                _In_  UINT Flags, // D3D11_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS
                _Out_  UINT *pNumQualityLevels);
    
    

    5.9.3.3 Copying Tiled Data

    As mentioned, existing methods in D3D for moving data around work with Tiled Resources just as if they are not Tiled, except that writes to unmapped areas are dropped and reads from unmapped areas produce 0. If a copy involves writing to the same memory location multiple times because multiple locations in the destination resource are mapped to the same tile memory, the resulting writes to multi-mapped tiles are nondeterministic/nonrepeatable - accesses happen in whatever order the hardware happens to execute the copy.

    This section describes methods for the following additional methods of copying:
    (a) between tiles in a Tiled Resource (at 64KB tile granularity) and (to/from) a Buffer in GPU memory (or staging resource) - CopyTiles()
    (b) from application provided memory to tiles in a Tiled Resource - UpdateTiles()
    These methods swizzle/deswizzle as needed, and allow a D3D11_TILE_COPY_NO_OVERWRITE flag when the caller promises the destination memory is not referenced by GPU work that is in flight.

    The tiles involved in the copy cannot include tiles containing packed mipmaps or results are undefined. To transfer data to/from mipmaps that the hardware packs into one tile, the standard (non-tile specific) Copy/Update APIs (or GenerateMips for the whole mip chain) must be used.


    5.9.3.3.1 Note on GenerateMips()

    Using GenerateMips() on a resource with partially mapped tiles will produce results that simply follow the rules for reading and writing NULL applied to whatever algorithm the hardware/driver happens to use to GenerateMips(). So it is not particularly useful for an application to bother doing this unless somehow the areas with NULL mappings (and their effect on other mips during the generation phase) will have no consequence on the parts of the surface the application does care about.

    Copying tile data from a staging surface or from application memory would be the way to upload tiles that may have been streamed off disk, for example. A variation when streaming off disk is uploading some sort of compressed data to GPU memory and then decoding on the GPU. The decode target could be a buffer resource in GPU memory, from which CopyTiles() then copies to the actual Tiled Resource. This copy step allows the GPU to swizzle when the swizzle pattern is not known. Swizzling is not needed if the Tiled Resource itself is a Buffer resource (e.g. as opposed to a Texture).

    The memory layout of the tiles in the non-tiled Buffer resource side of the copy is simply linear in memory within 64KB tiles, which the hardware/driver would swizzle/deswizzle per tile as appropriate when transferring to/from a Tiled Resource. For MSAA surfaces, each pixel's samples are traversed in sample-index order before moving to the next pixel. For tiles that are partially filled on the right side (for a surface that has a width not a multiple of tile width in pixels), the pitch/stride to move down a row is the full size in bytes of the number pixels that would fit across the tile if the tile was full. So there can be a gap between each row of pixels in memory. For specification simplicity, mipmaps smaller than a tile are not packed together in the linear layout. This seems to be a waste of memory space, but as mentioned copying to mips that the hardware packs together is not allowed via CopyTiles() or UpdateTiles(). The application can just use generic UpdateSubresource*() or CopySubresource*() APIs to copy small mips individually, though in the case of CopySubresource*() that means the linear memory has to be the same dimension as the Tiled Resource - CopySubresource*() can't copy from a Buffer resource to a Texture2D for instance.

    If a hardware standard swizzle is defined, flags could be added indicate that the data in the Buffer is to be interpreted in that format (no swizzle necessary on transfer), though alternative approaches to uploading data may also make sense in that case such as allowing allowing applications direct access to Tile Pool memory.

    Copying operations can be done on an immediate or deferred context.

    typedef enum D3D11_TILE_COPY_FLAGS
    {
        D3D11_TILE_COPY_NO_OVERWRITE = 0x00000001,
                   //   D3D11_TILE_COPY_NO_OVERWRITE indicates that the application promises
                   //   the GPU is not currently referencing any of the
                   //   portions of destination memory being written.
    
        D3D11_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE = 0x00000002,
                   //   D3D11_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE means copy tile data from the
                   //   specified buffer location, reading tiles sequentially,
                   //   to the specified tile region (in x,y,z order if the region is a box),
                   //   swizzling to optimal hardware memory layout as needed.
                   //   In this case the source data is pBuffer and the destination is pTiledResource
    
        D3D11_TILE_COPY_SWIZZLED_TILED_RESOURCE_TO_LINEAR_BUFFER = 0x00000004,
                   //   D3D11_TILE_COPY_SWIZZLED_TILED_RESOURCE_TO_LINEAR_BUFFER means copy tile data from the
                   //   tile region, reading tiles sequentially (in x,y,z order if the region is a box),
                   //   to the specified buffer location, deswizzling to linear memory layout as needed.
                   //   In this case the source data is pTiledResource and the destination is pBuffer
    };
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // CopyTiles
    // --------------------------------------------------------------------------------------------------------------------------------
    // Copy from buffer to tiled resource or vice versa.
    
    void
    ID3D11DeviceContext2::
    CopyTiles( _In_ ID3D11Resource* pTiledResource,
               _In_ const D3D11_TILED_RESOURCE_COORDINATE* pTileRegionStartCoordinate,
               _In_ const D3D11_TILE_REGION_SIZE* pTileRegionSize,
               _In_ ID3D11Buffer* pBuffer, // Default, dynamic or staging buffer
               _In_ UINT64 BufferStartOffsetInBytes,
               _In_ UINT Flags // D3D11_TILE_COPY_FLAGS
             );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // UpdateTiles
    // --------------------------------------------------------------------------------------------------------------------------------
    // Copy from application memory to tiled resource.
    
    void
    ID3D11DeviceContext2::
    UpdateTiles( _In_ ID3D11Resource* pDestTiledResource,
                 _In_ const D3D11_TILED_RESOURCE_COORDINATE* pDestTileRegionStartCoordinate,
                 _In_ const D3D11_TILE_REGION_SIZE* pDestTileRegionSize,
                 _In_ const void* pSourceTileData, // caller memory
                 _In_ UINT Flags // D3D11_TILE_COPY_FLAGS:
                      // Valid options: D3D11_TILE_COPY_NO_OVERWRITE
                      //                (the other flags aren't meaningful here, though
                      //                by definition the flag D3D11_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE
                      //                is basically what UpdateTiles does, sourcing from application memory.
               );
    

    5.9.3.4 Resize Tile Pool

    // --------------------------------------------------------------------------------------------------------------------------------
    // ResizeTilePool
    // --------------------------------------------------------------------------------------------------------------------------------
    // Resize a Tile Pool.  See Resizing Tile Pools(5.9.2.2.2) for discussion, including specifics about what
    // shrinking means.
    //
    // New Tile Pool size must be a multiple of 64KB (or 0) otherwise the call returns E_INVALIDARG.
    // On out of memory the call returns E_OUTOFMEMORY.  For either of these failures, the existing Tile Pool remains unchanged,
    // including existing mappings.  DXGI_ERROR_DEVICE_REMOVED is the other possible error code.  S_OK for success.
    //
    
    HRESULT
    ID3D11DeviceContext2::
    HRESULT ResizeTilePool( _In_ ID3D11Buffer* pTilePool,
                            _In_ UINT64 NewSizeInBytes );
    
    

    5.9.3.5 Tiled Resource Barrier

    // --------------------------------------------------------------------------------------------------------------------------------
    // TiledResourceBarrier
    // --------------------------------------------------------------------------------------------------------------------------------
    // With Tiled Resources applications have a lot of freedom to reuse tiles in different resources.  Sometimes it may not be clear
    // to a device/driver, without unreasonable tracking overhead, that some memory in a tile pool that was just written to is
    // now being used for reading (so caches may have to be flushed or a bubble might have to be introduced in the pipeline depending
    // on the timing in order to generate correct results).
    //
    // As an example, an application may copy to some tiles in a Tile Pool via one Tiled Resource but then read from the same
    // tiles using a different Tiled Resource.  This is different from using the same resource object first as a destination for
    // copying data and then as a source via ShaderResourceView read (which drivers can already tell must be kept in order).
    //
    // In full detail, the requirement of an application is as follows: When an application transitions from accessing (reading or writing)
    // some location in a Tile Pool with one subresource (e.g. mip slice) to accessing the same memory (read or write) via another subresource
    // or different Tiled Resource, in a way that would not be obvious to drivers (because they do not need to bother keeping track of where
    // tiles are being shared), the application must call TiledResourceBarrier after the first access to the resource and before the second
    // different method of access.  Calling TiledResourceBarrier isn't required if both accesses are reads.  The parameters are the
    // TiledResource that was accessed before the Barrier and the the TiledResource that will be accessed after the Barrier using the same
    // Tile Pool memory.  If the resources and subresources involved are the same, the API doesn't need to be called, as drivers track
    // hazards at the subresource level on their own, cheaply.
    //
    // The Barrier call informs the driver that operations issued to the resource before the call must complete before any accesses that
    // occur after the call via different Tiled Resource that shares the same memory.
    //
    // Either or both of the parameters (before or after the barrier) can be NULL.  NULL before the barrier means
    // all tiled resource accesses before the barrier that have mappings into the Tile Pool that the resource after the barrier maps to
    // must complete before the resource specified after the barrier can be referenced by the GPU. NULL after the barrier means
    // that any Tiled resources access after the barrier with mappings to the Tile Pool that the resource before the barrier maps
    // to can only be executed by the GPU after accesses to the tiled resource before the barrier are finished.  Both NULL means all
    // previous tiled resource accesses are complete before any subsequent tiled resource access may proceed (for all Tile Pools).
    //
    // Either a view pointer, a resource or NULL can be passed for each parameter.  Views are allowed both for
    // convenience but also to allow scoping of the barrier effect to a relevant portion of a resource.
    //
    // Rendering commands that the driver/hardware can tell are completely independent of the tiled resources identified in this
    // call are unconstrained in their order of execution with respect to accesses to the identified tiled resources and the barrier.
    // If exploiting reordering could produce visible side effects (given appropriate barriers were specified)
    // it is an invalid reordering by the system/hardware.
    //
    
    void
    ID3D11DeviceContext2::
    TiledResourceBarrier(
        _In_opt_ ID3D11DeviceChild* pTiledResourceOrViewAccessBeforeBarrier,
        _In_opt_ ID3D11DeviceChild* pTiledResourceOrViewAccessAfterBarrier
     );
    

    5.9.4 Pipeline Access to Tiled Resources

    Tiled Resources can be used in Shader Resource Views, Render Target Views, Depth Stencil Views and Unordered Access Views, as well as some bindpoints where Views aren't used, such as Vertex Buffer bindings. See the list of supported bindings earlier. Copy* operations also work on Tiled Resources.

    If multiple tile coordinates in one or more views is bound to the same memory location, reads and writes from different paths to the same memory will occur in a nondeterministic/nonrepeatable order of memory accesses.

    If all tiles behind a memory access footprint from a shader are mapped to unique tiles, behavior is identical on all implementations to the surface having the same memory contents in a non-tiled fashion.


    5.9.4.1 SRV Behavior with Non-Mapped Tiles

    Behavior for SRV reads that involve non-mapped tiles depends on the level of hardware support - see read behavior in Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements. The following summarizes the ideal behavior (which Tier 2 requires.

    Consider a texture filter operation that reads from a set of texels in an SRV. Texels that fall on non-mapped tiles contribute 0 in all non-missing components of the format, and the default for missing components(19.1.3.3), into the overall filter operation alongside contributions from mapped texels. The texels are all weighted and combined together undependent of whether the data came from mapped or non-mapped tiles.

    Some first generation Tier 2 level hardware does not meet this spec requirement and returns the 0 with defaults described above as the overall filter result if ANY texels (with nonzero weight) fall on non-mapped tiles. No other hardware will be allowed to miss the requirement to include all (nonzero weight) texels in the filter.

    It was considered to have an option to automatically fall back to a coarser mip in some fashion when a filter footprint hits missing tiles, either a the texel level, or just for the entire fetch. However there didn't seem to be a clear advantage here for the cost versus relying on applications figuring out how avoid or deal with missing tiles on their own.


    5.9.4.2 UAV Behavior with Non-Mapped Tiles

    Behavior of UAV reads and writes depends on the level of hardware support. See overall read and write behavior for Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements.

    Ideal behavior:

    Shader operations that read from a non-mapped tile in a UAV return 0 in all non-missing components of the format, and the default for missing components(19.1.3.3).

    Shader operations that attempt to write to a non-mapped tile cause nothing to be written to the non-mapped area (while writes to mapped area proceed). This ideal definition for write handling is not requried by Tier 2 - writes to non-mapped tiles may end up in a cache that subsequent reads could pick up.


    5.9.4.3 Rasterizer Behavior with Non-Mapped Tiles


    5.9.4.3.1 DepthStencilView

    Behavior of DSV reads and writes depends on the level of hardware support. See overall read and write behavior for Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements.

    Ideal behavior:

    If a tile is not mapped in the DepthStencilView, the return value from reading depth is 0, which is then fed into whatever operation(s) are configured for the depth read value. Write to the missing depth tile are dropped. This ideal definition for write handling is not requried by Tier 2 - writes to non-mapped tiles may end up in a cache that subsequent reads could pick up.


    5.9.4.3.2 RenderTargetView

    Behavior of RTV reads and writes depends on the level of hardware support. See overall read and write behavior for Tiled Resources Feature Tiers(5.9.7) for a breakdown of requirements.

    On all implementations it is valid for different RTVs (and DSV) bound simultaneously have different areas mapped vs non-mapped and have different sized surface formats (which means different tile shapes).

    Ideal behavior:

    Reads from RenderTargetViews return 0 in all non-missing components of the format, and the default for missing components(19.1.3.3). Writes to RenderTargetViews are dropped. This ideal definition for write handling is not requried - writes to non-mapped tiles may end up in a cache that subsequent reads could pick up.


    5.9.4.4 Tile Access Limitations With Duplicate Mappings


    5.9.4.4.1 Copying Tiled Resources With Overlapping Source and Dest

    If tiles in the source and dest area of a Copy* operation have duplicated mappings in the copy area that would have overlapped even if both resources were not Tiled Resources and the Copy* call supports overlapping copies, this will behave fine (as if the source is copied to a temp before going to the dest). However if the overlap is not obvious (like the source and dest resources are different but share mappings, or mappings are duplicated over a given surface), then results of the copy operation on the tiles that are shared are undefined.


    5.9.4.4.2 Copying To Tiled Resource with Duplicated Tiles in Dest Area

    Copying to a Tiled Resource with duplicated tiles in the destination area produces undefined results in these tiles unless the data itself is identical - different tiles may write the tiles in different orders.


    5.9.4.4.3 UAV Accesses to Duplicate Tiles Mappings

    Suppose an Unordered Access View on a Tiled Resource has duplicate tile mappings in its area or with other resources bound to the pipeline. Ordering of accesses to these duplicated tiles is undefined if performed by different threads, just as any ordering of memory access to UAVs in general is unordered.


    5.9.4.4.4 Rendering After Tile Mapping Changes Or Content Updates from Outside Mappings

    If a Tiled Resource's Tile Mappings have changed or content in mapped Tiled Pool tiles have changed via another Tiled Resource's mappings, and the Tiled Resource is going to be rendered via RenderTargetView or DepthStencilView, the application must Clear (using the fixed function Clear APIs) or fully copy over using Copy*/Update* APIs the tiles that have changed within the area being rendered (mapped or not). Failure of an application to clear/copy in these cases results in hardware optimization structures for the given RenderTargetView or DepthStencilView being stale and will result in garbage rendering results on some hardware and inconsistency across different hardware. These hidden optimization data structures used by hardware may be local to individual mappings, not visible to other mappings to the same memory.

    The ClearView API/DDI supports clearing RenderTargetViews with rects, and for hardware that supports Tiled Resources, ClearView must also support clearing of DepthStencilViews with rects, for depth only surfaces (without stencil). This allows applications to Clear only the necessary area of a surface.

    If an application needs to preserve existing memory contents of areas in a Tiled Resources where mappings have changed it has to work around the Clear requirement, unfortunately. This can be accomplished by the application by first saving the contents where Tile mappings have changed (by copying them to a temporary surface, for example using CopyTiles()), issuing the required Clear and then copying the contents back. While this would accomplish the task of preserving surface contents for incremental Rendering, the downside is that is that subsequent rendering performance on the surface may suffer because rendering optimizations may be lost.

    If a tile is mapped into multiple Tiled Resources at the same time and tile contents are manipulated by any means (render, copy etc.) via one of the Tiled Resoruces then if the same tile is to be rendered via any other Tiled Resource, the tile must be Cleared first as above.


    5.9.4.4.5 Rendering To Tiles Shared Outside Render Area

    Suppose an area in a Tiled Resource is being rendered to and the Tile Pool tiles referenced by the render area are also mapped to from outside the render area (including via other Tiled Resources, at the same time or not). Data rendered to these tiles is not guaranteed to appear correctly when viewed through the other mappings, even though the underlying memory layout is compatible. This is due to optimization data structures some hardware uses that can be local to individual mappings for renderable surfaces, not visible to other mappings to the same memory location. This restriction can be worked around by copying from the rendered mapping to all the other mappings to the same memory that might be accessed (or clearing that memory or copying other data to it if the old contents are no longer needed). While this seems redundant, it makes all other mappings to the same memory correctly understand how to access its contents, and at least the memory savings of having only a single physical memory backing remains intact. Also, note that when switching between using different Tiled Resources that share mappings (unless only reading), the TiledResourceBarrier API must be called in between.


    5.9.4.4.6 Rendering To Tiles Shared Within Render Area

    If an area in a Tiled Resources is being rendered to and within the render area multiple tiles are mapped to the same Tile Pool locaition, rendering results are undefined on those tiles.


    5.9.4.4.7 Data Compatibility Across Tiled Resources Sharing Tiles

    Suppose multiple Tiled Resources have mappings to the same Tile Pool locations and each resource is used to access the same data. This is only valid if the other rules about avoiding problems with hardware optimization structures are avoided, appropriate calls to TiledResourceBarrier made and the Tiled Resources are compatible with each other. The latter is described here (in terms of what it means for Tiled Resources sharing tiles to be incompatible). The conditions incompatibility accessing the same data across duplicate tile mappings are the use of different surface dimensions, format, or differences the presence of RenderTarget or DepthStencil BindFlags on the Resources. Writing to the memory with one type of mapping produces undefined results if subsequently reading or rendering via a mapping from an incompatible Resource. If the other Resource sharing mappings will be first initialized with new data (recycling the memory for a different purpose), that is fine since data is not bleeding across incompatible interpretations, however the TiledResourceBarrier API must be called when switching between accessing incompatible mappings like this.

    If the RenderTarget or DepthStencil BindFlag is not set on any of the resources sharing mappings with each other, there are far fewer restrictions: As long as the format and surface types (e.g Texture2D) are the same, tiles can be shared. Some cases of different format are compatible such as BC* surfaces and the equivalent sized uncompressed 32 bit or 16 bit per component format, like BC6H and R32G32B32A32. Many 32 bit per element formats can be aliased with R32_* as well (R10G10B10A2_*, R8G8B8A8_*, B8G8R8A8_*,B8G8R8X8_*,R16G16_*) - this has always been allowed for non Tiled Resources.

    Sharing between packed and non-packed tiles is fine if the formats are compatible and the tiles are filled with solid color.

    Finally, if nothing is common about the Resources sharing tile mappings except that none have RenderTarget/DepthStencil BindFlags, then only memory filled with 0 can be shared safely - it will appear as whatver 0 decodes to for the definition of the given Resource format (typically just 0).


    5.9.4.5 Tiled Resources Texture Sampling Features


    5.9.4.5.1 Overview

    The texture sampling features described here require Tier(5.9.7) 2 level of Tiled Resources support.


    5.9.4.5.2 Shader Feedback About Mapped Areas

    Any instruction that reads and/or writes to a Tiled Resource causes status information to be recorded. This is exposed as an optional extra return value on every resource access instruction that goes into a 32-bit temp register. The contents of the return value are opaque - direct reading by the shader program is disallowed. However dedicated instruction(s) (initially only one) allow status information to be extracted.


    5.9.4.5.3 Fully Mapped Check

    The check_access_mapped(22.4.26) instruction interprets the status return from a memory access and indicates whether all data being accessed was mapped in the resource - true (0xFFFFFFFF) or false (0x00000000).

    During filter operations, sometimes the weight of a given texel ends up being 0.0. An example is a linear sample with texture coordinates that fall directly on a texel center: 3 other texels (which ones they are can vary by hardware) contribute to the filter - but with 0 weight. These 0 weight texels do not contribute to the filter result at all, so if they happen to fall on NULL tiles they don't count as an unmapped access. Note the same guarantee applies for texture filters that include multiple mip levels - if the texels on one of the mipmaps is not mapped but the weight on those texels is 0, those texels don't count as an unmapped access.

    When sampling from a format that has fewer than 4 components (such as DXGI_FORMAT_R8_UNORM), any texels that fall on NULL tiles result in the a NULL mapped access being reported regardless of which component(s) the shader actually looks at in the result. For example reading from R8_UNORM and masking the read result in the shader with .gba/.yzw wouldn't appear to need to read the texture at all, but if the texel address is a NULL mapped tile it still counts as a NULL map access.

    The shader can check the status and pursue any desired course of action on failure. For example logging 'misses' (say via UAV write) and/or issuing another read clamped to a coarser LOD known to be mapped.  It may be useful for an application to track successful accesses as well in order to get a sense of what portion of the mapped set of tiles got accessed.

    One complication for logging is there is no mechanism for reporting the exact set of tiles that would have been accessed. The application can make conservative guesses based on knowing the coordinates it used for access, as well as using the lod instruction which returns what the hardware lod calculation is.

    Another complication is that lots of accesses will be to the same tiles, so there will be a lot of redundant logging and possibly contention on memory.  It could be convenient if the hardware could be given the option to not bother to report tile accesses if they were reported elsewhere before.  Perhaps the state of such tracking could be reset from the API (likely at frame boundaries).


    5.9.4.5.4 Per-sample MinLOD Clamp

    To help shaders avoid areas in mipmapped Tiled Resources that are known to be non-mapped, most shader instructions that involve using a Sampler (filtering) have a new mode that allows the shader to pass an additional float32 MinLOD clamp parameter to the texture sample. This value is in the View's mipmap number space, as opposed to the underlying resource.

    The hardware performs max(fShaderMinLODClamp,fComputedLOD) in the same place in the LOD calculation where the per-Resource MinLOD clamp occurs (which is also a max()).

    If the result of applying the Per-sample LOD clamp and any other LOD clamps defined in the sampler is an empty set, the result is the same out of bounds access result as the per-Resource minLOD clamp: 0 for components in the surface format and defaults for missing components.

    The lod instruction (which predates the per-sample minLOD clamp described here) returns both a clamped and unclamped LOD. The clamped LOD return from this lod instruction reflects all clamping including the per-resource clamp, but not a per-sample clamp. Per-sample clamp is controlled/known by the shader anyway, so the shader author can manually apply that clamp to the lod instruction's return value if desired.


    5.9.4.5.5 Shader Instructions

    The following shader instructions include combinations of feedback and/or clamp in addition to their basic operation, followed by instructions that examine the feedback return. If the clamp is used, it is an additional scaler float32 register or immediate operand. If feedback is requested, it comes out in an additional 32 bit scalar register operand that needs to be fed into instruction(s) that interpret feedback.

    These instructions can be used on Tiled or non-Tiled Resources for all applicable resource dimensions (Buffer, Texture1D/2D/3D). Non-Tiled Resources always appear to be fully mapped.

    The suffix _s indicates mapping status, and _cl indicates LOD clamp.

    The following instructions have a mapping status return option [_s] (but no clamp option):

    The following instructions have both mapping status [_s] and clamp [_cl] options:

    The following instruction examines the status return from any of the above instructions:

    Note there is no feedback for memory write instructions like store_uav_*. This could be added if needed, but at this time of design some hardware does not support it.


    5.9.4.5.6 Min/Max Reduction Filtering

    Applications may choose to manage their own data structures that inform them of what the mappings looks like for a Tiled Resource. An example would be a surface that contains a texel to hold information about for every tile in a Tiled Resource. One might store the first LOD that is mapped at a given tile location. By careful sampling of this data structure in a similar way that the Tiled Resource is intended to be sampled, one might discover what the minimum LOD that is fully mapped for an entire texture filter footprint will be. To help make this process easier, a new general purpose sampler mode is introduced, min/max filtering.

    Note there is disagreement among IHVs on the utility of min/max filtering for LOD tracking. It hasn't been proven. However, the feature may be useful for other purposes, such as perhaps the filtering of depth surfaces.

    Min/Max Reduction filtering is a mode on Samplers that fetches the same set of texels that a normal texture filter would fetch, but instead of blending the values to produce an answer, it returns the min() or max() of the texels fetched, on a per-component basis (e.g. the min of all the R values, separately from the min of all the G values etc.)

    The min/max operations follow D3D arithmetic precision rules. The order of comparisons does not matter.

    During filter operations that are not min/max, sometimes the weight of a given texel ends up being 0.0. An example is a linear sample with texture coordinates that fall directly on a texel center - 3 other texels (which ones they are may vary by hardware) contribute to the filter but with 0 weight. For any of these texels that would be 0 weight on a non-min/max filter, if the filter is min/max these texels still do not contribute to the result (and the weights do not otherwise affect the min/max filter operation).

    The full list of filter modes is shown in the D3D11_FILTER enum in the Sampler State(7.18.3) section - note the modes with MINIMUM and MAXIMUM in the name.

    Support for this feature depends on Tier(5.9.7) 2 support for Tiled Resources.


    5.9.4.6 HLSL Tiled Resources Exposure

    New HLSL syntax is required to support tiled resources in Shader Model 5.0 (allowed only on devices with Tiled Resources support). Each relevant HLSL intrinsic method for tiled resources (see the table below) accepts either one (feedback) or two (clamp and feedback in this order) additional optional parameters. For example, the Sample method is:

    Sample(sampler, location [, offset [, clamp [, feedback] ] ]).

    The offset, clamp and feedback parameters are optional. Programmers have to specify all optional parameters up to the one they need, which is consistent with the C++ rules for default function arguments. For example, if the feedback status is needed, both offset and clamp parameters need to be explicitly supplied to Sample, even though they may not be logically needed.

    The clamp parameter is a scalar float value. The literal value of clamp=0.0f indicates that clamp operation is not performed.

    The feedback parameter is a uint variable that can be supplied to memory-access querying intrinsic: CheckAccessFullyMapped. Programmers must not modify or interpret the value of the feedback parameter; however, the compiler does not provide any advanced analysis and diagnostics to detect this.

    There is one HLSL intrinsic to query the feedback status:

    bool CheckAccessFullyMapped(in uint FeedbackVar);

    CheckAccessFullyMapped interprets the value of FeedbackVar and returns true if all data being accessed was mapped in the resource; otherwise, CheckAccessFullyMapped returns false.

    If either clamp or feedback parameter is present, the compiler emits a variant of the basic instruction. For example, Sample of a tiled resource generates sample_cl_s instruction. If neither clamp nor feedback is specified, the compiler emits the basic instruction, so that there is no change from the current behavior. The clamp value of 0.0f indicates that no clamp is performed; thus, the driver compiler can further tailor the instruction to the target hardware. If feedback is a NULL register in an instruction, the feedback is unused; thus, the driver compiler can further tailor the instruction to the target architecture.

    If the HLSL compiler infers that clamp is 0.0f and feedback is unused, the compiler emits the corresponding basic instruction (e.g., sample rather than sample_cl_s).

    If a tiled resource access consists of several constituent byte code instructions, e.g., for structured resources, the compiler aggregates individual feedback values via the OR operation to produce the final feedback value. Therefore, programmers see a single feedback value for such a complex access.

    This is the summary table of HLSL intrinsic methods changed to support feedback and/or clamp. These all work on tiled and non-tiled resources of all dimensions. Non-tiled resources always appear to be fully mapped.

    HLSL Objects Intrinsic methods with feedback option
    (*) - also has clamp option
    [RW]Texture2D
    [RW]Texture2DArray
    TextureCUBE
    TextureCUBEArray
    Gather
    GatherRed
    GatherGreen
    GatherBlue
    GatherAlpha
    GatherCmp
    GatherCmpRed
    GatherCmpGreen
    GatherCmpBlue
    GatherCmpAlpha
    [RW]Texture1D
    [RW]Texture1DArray
    [RW]Texture2D
    [RW]Texture2DArray
    [RW]Texture3D
    TextureCUBE
    TextureCUBEArray
    Sample*
    SampleBias*
    SampleCmp*
    SampleCmpLevelZero
    SampleGrad*
    SampleLevel
    [RW]Texture1D
    [RW]Texture1DArray
    [RW]Texture2D
    Texture2DMS
    [RW]Texture2DArray
    Texture2DArrayMS
    [RW]Texture3D
    [RW]Buffer
    [RW]ByteAddressBuffer
    [RW]StructuredBuffer
    Load

    5.9.5 Tiled Resource DDIs


    5.9.5.1 Resource Creation DDI: D3D11DDIARG_CREATERESOURCE

    This existing DDI includes new options on the MiscFlags parameter:

    D3DWDDM1_3DDI_RESOURCE_MISC_TILED :
             Indicates the resource is tiled. Constraints on when
             this flag can be used are described elsewhere.
    
    D3DWDDM1_3DDI_RESOURCE_MISC_TILE_POOL :
             Indicates the resource is a tile pool.  Must be a Buffer,
             with usage DEFAULT.  Full constraints described elsewhere.
    

    5.9.5.2 Texture Filter Descriptor: D3D10_DDI_FILTER

    This existing enum for filter types has new entries for min/max filtering.

    typedef enum D3D10_DDI_FILTER
    {
        // Bits used in defining enumeration of valid filters:
        // bits [1:0] - mip: 0 == point, 1 == linear, 2,3 unused
        // bits [3:2] - mag: 0 == point, 1 == linear, 2,3 unused
        // bits [5:4] - min: 0 == point, 1 == linear, 2,3 unused
        // bit  [6]   - aniso
        // bits [8:7] - reduction type:
        //                0 == standard filtering
        //                1 == comparison
        //                2 == min
        //                3 == max
        // bit  [31]  - mono 1-bit (narrow-purpose filter)
    
        D3D10_DDI_FILTER_MIN_MAG_MIP_POINT                              = 0x00000000,
        D3D10_DDI_FILTER_MIN_MAG_POINT_MIP_LINEAR                       = 0x00000001,
        D3D10_DDI_FILTER_MIN_POINT_MAG_LINEAR_MIP_POINT                 = 0x00000004,
        D3D10_DDI_FILTER_MIN_POINT_MAG_MIP_LINEAR                       = 0x00000005,
        D3D10_DDI_FILTER_MIN_LINEAR_MAG_MIP_POINT                       = 0x00000010,
        D3D10_DDI_FILTER_MIN_LINEAR_MAG_POINT_MIP_LINEAR                = 0x00000011,
        D3D10_DDI_FILTER_MIN_MAG_LINEAR_MIP_POINT                       = 0x00000014,
        D3D10_DDI_FILTER_MIN_MAG_MIP_LINEAR                             = 0x00000015,
        D3D10_DDI_FILTER_ANISOTROPIC                                    = 0x00000055,
        D3D10_DDI_FILTER_COMPARISON_MIN_MAG_MIP_POINT                   = 0x00000080,
        D3D10_DDI_FILTER_COMPARISON_MIN_MAG_POINT_MIP_LINEAR            = 0x00000081,
        D3D10_DDI_FILTER_COMPARISON_MIN_POINT_MAG_LINEAR_MIP_POINT      = 0x00000084,
        D3D10_DDI_FILTER_COMPARISON_MIN_POINT_MAG_MIP_LINEAR            = 0x00000085,
        D3D10_DDI_FILTER_COMPARISON_MIN_LINEAR_MAG_MIP_POINT            = 0x00000090,
        D3D10_DDI_FILTER_COMPARISON_MIN_LINEAR_MAG_POINT_MIP_LINEAR     = 0x00000091,
        D3D10_DDI_FILTER_COMPARISON_MIN_MAG_LINEAR_MIP_POINT            = 0x00000094,
        D3D10_DDI_FILTER_COMPARISON_MIN_MAG_MIP_LINEAR                  = 0x00000095,
        D3D10_DDI_FILTER_COMPARISON_ANISOTROPIC                         = 0x000000d5,
    
        WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_MIP_POINT                     = 0x00000100,
        WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_POINT_MIP_LINEAR              = 0x00000101,
        WDDM1_3DDI_FILTER_MINIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT        = 0x00000104,
        WDDM1_3DDI_FILTER_MINIMUM_MIN_POINT_MAG_MIP_LINEAR              = 0x00000105,
        WDDM1_3DDI_FILTER_MINIMUM_MIN_LINEAR_MAG_MIP_POINT              = 0x00000110,
        WDDM1_3DDI_FILTER_MINIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR       = 0x00000111,
        WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_LINEAR_MIP_POINT              = 0x00000114,
        WDDM1_3DDI_FILTER_MINIMUM_MIN_MAG_MIP_LINEAR                    = 0x00000115,
        WDDM1_3DDI_FILTER_MINIMUM_ANISOTROPIC                           = 0x00000155,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_MIP_POINT                     = 0x00000180,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_POINT_MIP_LINEAR              = 0x00000181,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT        = 0x00000184,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_POINT_MAG_MIP_LINEAR              = 0x00000185,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_LINEAR_MAG_MIP_POINT              = 0x00000190,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR       = 0x00000191,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_LINEAR_MIP_POINT              = 0x00000194,
        WDDM1_3DDI_FILTER_MAXIMUM_MIN_MAG_MIP_LINEAR                    = 0x00000195,
        WDDM1_3DDI_FILTER_MAXIMUM_ANISOTROPIC                           = 0x000001d5
    
        D3D10_DDI_FILTER_TEXT_1BIT                                      = 0x80000000 // Only filter for R1_UNORM format
    } D3D10_DDI_FILTER;
    

    5.9.5.3 Structs used by Tiled Resource DDIs

    
    typedef struct D3DWDDM1_3DDI_TILED_RESOURCE_COORDINATE
    {
        // Coordinate values below index tiles (not pixels or bytes).
        UINT X; // Used for buffer, 1D, 2D, 3D
        UINT Y; // Used for 2D, 3D
        UINT Z; // Used for 3D
        UINT Subresource; // indexes into mips, arrays. Used for 1D, 2D, 3D
        // For mipmaps that are packed into a single tile, any subresource
        // value that indicates any of the packed mips all refer to the same tile.
    };
    
    typedef struct D3DWDDM1_3DDI_TILE_REGION_SIZE
    {
        UINT NumTiles;
        BOOL bUseBox; // TRUE: Uses width/height/depth parameters below to define the region.
                      //   width*height*depth must match NumTiles above.  (While
                      //   this looks like redundant information, the application likely has to know
                      //   how many tiles are involved anyway.)
                      //   The downside to using the box parameters is that one update region cannot
                      //   span mipmaps (though it can span array slices via the depth parameter).
                      //
                      // FALSE: Ignores width/height/depth parameters - NumTiles just traverses tiles in
                      //   the resource linearly across x, then y, then z (as applicable) then spilling over
                      //   mips/arrays in subresource order.  Useful for just mapping an entire resource
                      //   at once.
                      //
                      // In either case, the starting location for the region within the resource
                      // is specified as a separate parameter outside this struct.
    
        UINT Width;   // Used for buffer, 1D, 2D, 3D
        UINT16 Height; // Used for 2D, 3D
        UINT16 Depth; // For 3D or arrays.  For arrays, advancing in depth skips to next slice of same mip size.
    };
    
    typedef enum D3DWDDM1_3DDI_TILE_MAPPING_FLAG
    {
        D3DWDDM1_3DDI_TILE_MAPPING_NO_OVERWRITE = 0x00000001,
    };
    
    typedef enum D3DWDDM1_3DDI_TILE_RANGE_FLAG
    {
        D3DWDDM_1_3DDI_TILE_RANGE_NULL = 0x00000001,
        D3DWDDM_1_3DDI_TILE_RANGE_SKIP = 0x00000002,
        D3DWDDM_1_3DDI_TILE_RANGE_REUSE_SINGLE_TILE = 0x00000004,
    };
    
    typedef enum D3DWDDM1_3DDI_TILE_COPY_FLAG
    {
        D3DWDDM1_3DDI_TILE_COPY_NO_OVERWRITE = 0x00000001,
        D3DWDDM1_3DDI_TILE_COPY_LINEAR_BUFFER_TO_SWIZZLED_TILED_RESOURCE = 0x00000002,
        D3DWDDM1_3DDI_TILE_COPY_SWIZZLED_TILED_RESOURCE_TO_LINEAR_BUFFER = 0x00000004,
    };
    
    typedef enum D3DWDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAG
    {
        D3DWDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_TILED_RESOURCE = 0x00000001,
    };
    

    5.9.5.4 DDI Functions

    // --------------------------------------------------------------------------------------------------------------------------------
    // UpdateTileMappings
    // --------------------------------------------------------------------------------------------------------------------------------
    // See API - runtime simply passes through parameters after validation of most parameters except that tile regions actually
    // fit on the specified resource.  The driver should ignore individual regions that are invalidly specified and then drop the
    // remainder of the call (no need to back out progress so far).  The debug runtime validates the parameters fully.
    //
    // Errors are reported via the call back pfnSetErrorCb.  Valid errors are out of memory and device removed.  On out of memory
    // (possible if memory allocation for page table storage fails), tile mappings are left in their original state before the call.
    //
    // If a driver implements commandlists and out of memory occurs when executing UpdateTileMappings in a commandlist,
    // the driver must invoke device removed.  Applications can avoid this situation by only doing update calls that change existing
    // mappings from Tiled Resources within command lists (so drivers will not have to allocate page table memory, only change the mapping).
    //
    // Note that many of the array parameters are optional and take special meaning if NULL as follows:
    // If pTiledResourceRegionStartCoordinates is NULL at the API (only allowed if NumTiledResourceRegions is 1), the runtime fills in a default
    // coordinate of {0,0,0,0} that is passed to the DDI (so the DDI will never see NULL).
    // If pTiledResourceRegionSizes is NULL at the DDI, all regions are assumed to be a single tile.  At the API if NumTiledResourceRegions 1,
    // pTiledResourceregionStartCoordinates is NULL and pTiledResourceRegionSizes is NULL, the runtime calls the DDI with pTiledResourceRegionSizes
    // filled in to cover the entire resource (so the DDI won't see NULL for pTiledResourceRegionSizes in this case).
    //
    // If pRangeFlags is NULL, all tile ranges have 0 for Range Flags.
    // If pRangeTileCounts is NULL, all tile ranges have size 1 tile.
    // If pRangeFlags[i] specifies WDDM1_3DDI_TILE_MAPPING_NULL or _SKIP, the corresponding entry in pTilePoolStartOffsets[i] is ignored,
    //    and if the call defines nothing but NULL/SKIPs pTilePoolStartOffsets can be NULL.
    //
    // At the API if NumRanges is 1 and pRangeTileCounts is 0, the runtime automatically fills in pRangeTileCounts[0] with the
    // total number of tiles specified by all the Tile Regions.
    //
    // See the API description for examples of common calling patterns - it might make sense for drivers to special-case some of
    // these if it turns out they could be executed more efficiently than through the path that handles the most general case.
    //
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_UPDATETILEMAPPINGS )(
        D3D10DDI_HDEVICE hDevice,
        D3D10DDI_HRESOURCE hTiledResource,
        UINT NumTiledResourceRegions,
        _In_reads_(NumTiledResourceRegions) const D3DWDDM1_3DDI_TILED_RESOURCE_COORDINATE* pTiledResourceRegionStartCoordinates,
        _In_reads_opt_(NumTiledResourceRegions) const D3DWDDM1_3DDI_TILE_REGION_SIZE* pTiledResourceRegionSizes,
        D3D10DDI_HRESOURCE hTilePool,
        UINT NumRanges,
        _In_reads_opt_(NumRanges) const UINT* pRangeFlags, // D3DWDDM1_3DDI_TILE_RANGE_FLAG
        _In_reads_opt_(NumRanges) const UINT* pTilePoolStartOffsets,
        _In_reads_opt_(NumRanges) const UINT* pRangeTileCounts,
        UINT Flags // D3DWDDM1_3DDI_TILE_MAPPING_FLAG
    );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // CopyTileMappings
    // --------------------------------------------------------------------------------------------------------------------------------
    // See API - runtime simply passes through parameters with minimal validation (it does drop the call if the regions don't fit).
    //
    // Errors are reported via the call back pfnSetErrorCb.  Valid errors are out of memory and device removed.  On out of memory
    // (possible if memory allocation for page table storage fails), tile mappings are left in their original state before the call.
    //
    // If a driver implements commandlists and out of memory occurs when executing CopyTileMappings in a commandlist,
    // the driver must invoke device removed.  Applications can avoid this situation by only doing copy calls that change existing
    // mappings from Tiled Resources within command lists (so drivers will not have to allocate page table memory, only change the mapping).
    //
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_COPYTILEMAPPINGS )(
        D3D10DDI_HDEVICE hDevice,
        D3D10DDI_HRESOURCE hDestTiledResource,
        _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pDestRegionStartCoordinate,
        D3D10DDI_HRESOURCE hSourceTiledResource,
        _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pSourceRegionStartCoordinate,
        _In_ const WDDM1_3DDI_TILE_REGION_SIZE* pTileRegionSize,
        UINT Flags // WDDM1_3DDI_TILE_MAPPING_FLAGS
    );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // CopyTiles
    // --------------------------------------------------------------------------------------------------------------------------------
    // See API - runtime simply passes through parameters with minimal validation.
    //
    // This DDI is not expected to fail (runtime will not check).
    
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_COPYTILES )(
        D3D10DDI_HDEVICE hDevice,
        D3D10DDI_HRESOURCE hTiledResource,
        _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pTileRegionStartCoordinate,
        _In_ const WDDM1_3DDI_TILE_REGION_SIZE* pTileRegionSize,
        D3D10DDI_HRESOURCE hBuffer, // Default, dynamic or staging buffer
        UINT64 BufferStartOffsetInBytes,
        UINT Flags // WDDM1_3DDI_TILE_COPY_FLAGS
    );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // UpdateTiles
    // --------------------------------------------------------------------------------------------------------------------------------
    // See API - runtime simply passes through parameters with minimal validation.
    //
    // This DDI is not expected to fail (runtime will not check).
    
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_UPDATETILES )(
        D3D10DDI_HDEVICE hDevice,
        D3D10DDI_HRESOURCE hDestTiledResource,
        _In_ const WDDM1_3DDI_TILED_RESOURCE_COORDINATE* pDestTileRegionStartCoordinate,
        _In_ const WDDM1_3DDI_TILE_REGION_SIZE* pDestTileRegionSize,
        _In_ const VOID* pSourceTileData, // caller memory
        UINT Flags // WDDM1_3DDI_TILE_COPY_FLAGS
    );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // TiledResourceBarrier
    // --------------------------------------------------------------------------------------------------------------------------------
    // See API - runtime simply passes through parameters with minimal validation.
    //
    // This DDI is not expected to fail (runtime will not check).
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_TILEDRESOURCEBARRIER )(
        D3D10DDI_HDEVICE hDevice,
        D3D11DDI_HANDLETYPE TiledResourceAccessBeforeBarrierHandleType,
        _In_opt_ const VOID* hTiledResourceAccessBeforeBarrier,
        D3D11DDI_HANDLETYPE TiledResourceAccessAfterBarrierHandleType,
        _In_opt_ const VOID* hTiledResourceAccessAfterBarrier
    );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // GetMipPacking
    // --------------------------------------------------------------------------------------------------------------------------------
    // For a given tiled resource, returns how many mips are packed
    // are packed and how many tiles are needed to store all the packed mips.
    // Packed mips include cases where multiple small mips share tile(s) and
    // also mips for which a given device cannot use standard tile shapes.  It is possible
    // for an entire resource to be considered packed.
    //
    // Applications are not told the tile shapes/layout for packed mips and must simply map
    // all or none of the packed tiles if any of the mipmaps with are to be accessed.
    // Otherwise the observed mapping of individual pixels accessed will be undefined - IHV specific.
    //
    // For array surfaces, the returned values are the counts for a single array slice,
    // given the tile breakdown is identical for the mipmaps of each array slice.
    //
    // Mipmaps whose pixel dimensions fully fill at least one standard shaped tile in all
    // dimensions are not allowed to be considered part of the set of packed mips, otherwise
    // the runtime will remove the device on an invalid driver.
    // One example of dimensions that a device can validly lump into
    // the packed tiles (meaning the IHV can use its own custom tile breakdown) is
    // a mip that is at least one tile wide but less than a tile high.  Ideally though,
    // a device would stick with the standard tile breakdown for this case (so the application can
    // manage the tiles in a standard way).  If a device does need to use a custom tiling,
    // the application is not told what the tile breakdown is (only how many tiles are involved
    // in the packing overall), and thus loses some freedom.
    //
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_GETMIPPACKING )(
        D3D10DDI_HDEVICE hDevice,
        D3D10DDI_HRESOURCE hTiledResource,
        _Out_ UINT* pNumPackedMips, // How many mips are packed, for a given array slice,
                                    // including any mips that don't use the standard tile
                                    // shapes.  If there is no packing, return 0.
        _Out_ UINT* pNumTilesForPackedMips, // How many tiles the packed mips fit into,
                                            // for a given array slice. Ignored if
                                            // *pNumPackedMips returned 0.
    
    );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // CheckMultisampleQualityLevels
    // --------------------------------------------------------------------------------------------------------------------------------
    // Variant of the existing DDI for checking multisample quality level support with a new flags field that allows
    // tiled resource to be specified.
    //
    
    typedef enum WDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS
    {
        WDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_TILED_RESOURCE = 0x00000001,
    };
    
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_CHECKMULTISAMPLEQUALITYLEVELS )(
        D3D10DDI_HDEVICE hDevice,
        DXGI_FORMAT Format,
        UINT SampleCount,
        UINT Flags, // WDDM1_3DDI_CHECK_MULTISAMPLE_QUALITY_LEVELS_FLAGS
        _Out_ UINT* pNumQualityLevels
    );
    
    // --------------------------------------------------------------------------------------------------------------------------------
    // ResizeTilePool
    // --------------------------------------------------------------------------------------------------------------------------------
    // See API - runtime simply passes through parameters with minimal validation (it does fail the API call if the size is not a multiple
    // of tile size or 0).
    //
    // Errors are reported via the call back pfnSetErrorCb.  Valid errors are out of memory and device removed.  On out of memory,
    // tile mappings are left in their original state before the call.
    //
    
    typedef VOID ( APIENTRY* PFND3DWDDM1_3DDI_RESIZETILEPOOL )(
        D3D10DDI_HDEVICE hDevice,
        D3D10DDI_HRESOURCE hTilePool,
        UINT64 NewSizeInBytes
    );
    
    

    5.9.6 Quilted Textures - For future consideration only

    This section is not part of the requirements for the initial implementation of Tiled Resources - it is for future consideration only.

    Texture filtering shader instructions can view Texture2DArray Resources as if all the array slices are arranged in a "quilt"/grid that appears as one surface rather than an array of them.

    The term "quilt" is meant to evoke the analogy of a collection of rectangular pieces of fabric that have been stitched together in a grid, but instead of fabric, the pieces are slices of a Texture2DArray.

    This enables applications to achieve texture filtering on surfaces that appear far larger than the size limits for individual Texture2D surfaces imposed by D3D.

    Ideally, double precision texture coordinate interpolation would be supported, so that precision could be maintained when interpolating and representing normalized coordinate values over surfaces that are too large for float32 precision (D3D's texture size limits are basically already there). However requiring double precision, and furthermore, requiring hardware to support individual surfaces that scale indefinitely in size, is out of scope in the timeframe for this feature.

    Any Texture2DArray Resource that is not Multisampled can have a Quilted Shader Resource View created on it. Starting with a Texture2DArray Resource, the following parameters describe how to define a Quilt:

    // Descriptor for building a Quilt SRV from a Texture2DArray
    typedef struct D3D11_TEX2D_QUILT_SRV
    {
        UINT MostDetailedMip;
        UINT MipLevels;
        UINT FirstArraySlice; // First slice to use in the quilt (does this have to be 0?)
        UINT QuiltWidthInArraySlices;
        UINT QuiltHeightInArraySlices;
    };
    
    // Array slices are assigned into the Quilt starting from FirstArraySlice
    // at the top-left of the Quilt, progressing in row order.
    // e.g. if FirstArraySlice is 0, the width is 2 and the height is 2,
    // The array slices map to the quilt like this:
    //  0 1
    //  2 3
    
    

    An IHV requested constraints on the Quilt Width/Height. One constraint could be the max QuiltWidthInArraySlices is 32, same for Height. And these dimensions may have to be pow2, though the Quilt should at least be allowed to be non-square in ArraySlices.

    One observation is that even if Quilt dimensions are constrained to pow2, applications that wish to represent nonPow2 overall surface dimensions (at the texel level) can still pick nonPow2 dimensions for the individual Array slices (all the same).

    Either Tiled or non-Tiled Resources can be used for a Quilt SRV, though Tiled Resources will likely be far more practical for managing massive surfaces.


    5.9.6.1 Sampling Behavior for Quilted Textures

    Shaders have to declare the dimension (e.g. Texture2D) of any SRV they access. This applies to Quilted Texture2D SRVs as well (the Quilt property will be part of the dimensionality naming).

    Any Shader instruction that involves the texture filtering hardware (e.g. instructions that take a Sampler as a parameter) sees the Quilting on a Quilted Texture2D, but addresses the surface using the same coordinates as if it is a Texture2DArray. That means that the texture coordinates include an integer array slice in addition to the U/V normalized coordinates. The U/V normalized coordinates are relative to the selected array slice. So coordinates in the range [0..1] span the selected array slice, just like a normal Texture2DArray. However U/V coordinates outside [0..1] refer to the appropriate neighboring array slice in the Quilt layout. e.g. a U coordinate of 1.5 indicates the middle of the array slice to the right in the quilt. The texture filtering hardware knows how to navigate the quilt in this fashion for each individual texel that is fetched.

    This Quilt traversal ability is similar to the way the texture filtering hardware also understands how to navigate across a TextureCube from face to face.

    Hardware derivative calculations do not understand anything about Quilting; they are not able to remap coordinates from different array slices into the same number space.

    For hardware derivative calculations (e.g. used in mipmap LOD calculation) to work correctly on Quilted texture coordinates, applications can simply use the same array-slice for all the coordinates in a given primitive (e.g. triangle). If a triangle spans multiple array slices, the coordinates would have to be mapped to the normalized space of any one of the array slices, making use of texture coordinates outside [0..1].

    The ability of the filtering hardware to traverse over the Quilt applies to the mipmaps as well.

    The number of mipmaps available to a given Array Slice is limited by the dimensions of the individual Array slice. This means that a Quilt Texture2D never has all mipmaps available to it (like a pyramid with the top chopped off). The effective size of the coarsest mipmap in a Quilt is the Quilt dimensions in texels (the 1x1 mip from each Array Slice quilted together).

    If an application really needs to model a full mipmap pyramid while using Quilts, it must resort to something like creating a second texture that "caps" the pyramid. The "cap" might overlap one mip level with the Quilt (so linear filtering across mips remains well posed). Then at the time of sampling, the application can choose to sample from either the Quilt texture and the "cap" texture based on the LOD.

    When an application is generating mipmap data for a Quilt, it would be incorrect to generate the mipmap chain for each Array Slice's mip chain independently. Instead, the mipmap contents should be calculated as if the Quilt is one huge surface. That is what the texture filtering hardware is assuming.

    When falling off an edge of the entire Quilt, the coordinate wraps to the other side of the entire Quilt. The Sampler addressing configuration (wrap/mirror/border etc.) is ignored for Quilts.

    This constraint to wrap-only was requested by an IHV. Ideally, all addressing modes available to non-Quilt surfaces (wrap, border, clamp etc.) would operate as expected when sampling off the end of a Quilt.

    The resinfo instruction (which reports texture dimensions to the shader) reports the dimensions of a Quilted Texture2D not in terms of the underlying Texture2DArray but rather as if it is a large non-array texture whose width/height span the quilt. The number of mipmaps is of course the same for every array slice as for the entire quilt.


    5.9.7 Tiled Resources Features Tiers

    Windows Blue exposes Tiled Resources support in two tiers using caps. In future releases, a new tier may be added including the recommendations listed below.


    5.9.7.1 Tier 1


    5.9.7.1.1 Limitations affecting Tier 1 only

    5.9.7.2 Tier 2


    5.9.7.3 Some Future Tier Possibilities


    5.9.7.4 Capability Exposure


    5.9.7.4.1 Tiled Resources Caps

    The CheckFeatureSupport DDI has a query for Tiled Resources support:

    This query reports support via flags bitfield to allows for some amount of future expansion of the caps reporting at the DDI needed. The Tier flags are cumulative (if the runtime sees Tier 2 support it assumes Tier 1 support regardless of the flag).

    typedef enum D3DWDDM1_3DDI_TILED_RESOURCES_SUPPORT_FLAG
    {
        D3DWDDM1_3DDI_TILED_RESOURCES_TIER_1_SUPPORTED = 0x00000001,
        D3DWDDM1_3DDI_TILED_RESOURCES_TIER_2_SUPPORTED = 0x00000002,
    } D3DWDDM1_3DDI_TILED_RESOURCES_SUPPORT_FLAG;
    
    // D3DWDDM1_3DDICAPS_D3D11_OPTIONS1
    typedef struct D3DWDDM1_3DDI_D3D11_OPTIONS_DATA1
    {
        UINT TiledResourcesSupportFlags;
    } D3DWDDM1_3DDI_D3D11_OPTIONS_DATA1;
    
    

    At the API, the Tiers are exposed via CheckFeatureSupport using an enum for the Tiers. Support for Min/Max Filtering is called out as a separate cap since the feature is distinct from Tiled Resources, however the runtime simply sets this capability true for hardware that supports Tier 2 and false for any lower level.

    typedef enum D3D11_TILED_RESOURCES_TIER
    {
        D3D11_TILED_RESOURCES_NOT_SUPPORTED = 0,
        D3D11_TILED_RESOURCES_TIER_1 = 1,
        D3D11_TILED_RESOURCES_TIER_2 = 2,
    } D3D11_TILED_RESOURCES_TIER;
    
    typedef struct D3D11_FEATURE_DATA_D3D11_OPTIONS1
    {
        D3D11_TILED_RESOURCES_TIER TiledResourcesTier;
        BOOL MinMaxFiltering;
    } D3D11_FEATURE_DATA_D3D11_OPTIONS1;
    

    5.9.7.4.2 Multisampling Caps

    The CheckMultisampleQualityLevels1 API and corresponding CheckMultisampleQualityLevels DDI now has a flags field to allow the driver to be queried for their level of support for Multisampling on Tiled Resources (which can be different from the level of support for non-tiled resources - the number of Quality Levels for example).


    6 Multicore


    Chapter Contents

    (back to top)

    6.1 Features
    6.2 Thread Re-entrant Create routines
    6.3 Command Lists
    6.4 DDI Features and Changes


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    The objectives of the features described in this section are to enable efficient distribution of rendering workload/ overhead in the application, runtime, and driver across multiple CPU cores in D3D11. These architectural changes are designed to allow multithreaded rendering applications to be written without overbearing restrictions, and gain close to the expected efficiency advantages when doing so.

    The primary features discussed are:

    1. Asynchronous creation of object types in separate threads.
    2. Command Lists, (a.k.a. Display Lists) which can be created asynchronously in separate threads.

    A separate D3D11 API/DDI spec contains more concrete implementation details about the topics discussed here.

    6.1 Features

    Applications would like to create all object types (most particularly resources and shaders) on different threads simultaneously and in parallel with other rendering threads, especially to enable background or bulk loading/ compiling. D3D11 will continue to rely on shared resources to achieve fully parallel GPU usage or multi-GPU usage, which effectively means only limited resource sharing is available for such scenarios. Lastly, the ability to generate Command Lists also fits in well when trying to leverage multi-core CPUs, as long as each Command List can be built on separate CPU threads. However, Command Lists are still required to be executed by the one thread that is, generally, dedicated as the render thread.

    It is important to note that although Command Lists are reusable across frames, the design point for this feature is use-once. Command List creation overhead in the runtime and driver should be low enough that single-use for the sole purpose of distribution of work across threads provides a significant performance win. Likewise, the overhead of submitting the Command List in the main rendering thread (immediate context) should be minimized – the design should diminish any need to patch or recompile Command Lists. If multi-use optimizations become interesting, implementations are encourages to promote such optimizations once a use-threshold has been reached. While the use of a single-use hint flag has been considered, detecting multi-use seems best to avoid application abuse/ mis-use of hints.

    Overview (the names here were chosen to align with kernel concepts to promote quicker understanding, and do not represent the final API or DDI):

    The main aspects to notice are: the separation of IDevice from IContext (as IContext is expected to be implemented by two types of Contexts), the concept of a single Immediate Context per Device, the possibility of multiple Deferred Contexts, the Command List object types, and all the new methods that deal with these new objects. It is not expected that Map, Unmap, and GetData will work on a Deferred Context, while Finalize will not work on the Immediate Context. Further details and options are provided later.

    6.2 Thread Re-entrant Create routines

    D3D11 allows creation routines to be thread re-entrant, as highlighted in the diagram by grouping such methods on the IDevice interface. This is not accomplished with coarse-grained critical sections. Fine-grained critical-sections are required internally, when necessary. Ideally, no internal synchronization needs to occur; but that is probably not realistic. Not only can one thread be rendering (i.e. calling Draw) while another thread is calling CreateShader; but two threads can be calling CreateShader, while a third thread calls CreateResource, and a fourth is rendering, etc. Due to symmetry, destruction of objects will also be re-entrant. However, the typical destruction of an object goes through multiple stages to keep destruction performant. See Deferred Destruction(6.4.3) for details.

    6.2.1 Better Support for Initial Data

    In the D3D10 timeframe, the majority of drivers treated Initial Data passed to the Create functions equivalent to using UpdateSubresource, which is technically a rendering command that naturally presents obstacles for separating creation and rendering. In addition, the UpdateSubresource path would typically force the resource to be faulted into video memory. With changes to the OS kernel, the driver can use the Map/ Unmap path for Initial Data; but this path is unavailable for both Vista and Windows 7. Unfortunately, drivers are required to significantly change their current implementation surrounding this feature, in order to concurrently upload initial data without significantly perturbing the render thread/ frame rate. This is viewed as short-term pain, until the desired kernel changes are available, with an unknown duration for short-term.


    6.3 Command Lists


    Section Contents

    (back to chapter)

    6.3.1 Overview
    6.3.2 Fire and Forget Model, No Feedback
    6.3.3 No Context State Inheritance
    6.3.4 No Context State Aftermath
    6.3.5 Object State Inheritance & Aftermath
    6.3.6 Query Interactions
    6.3.7 Nested Command Lists
    6.3.8 Allow Map Write on Resources with Restriction
    6.3.9 Application Immutable, but Patching is Still Required

    6.3.9.1 Discarded Dynamic Resources
    6.3.9.2 SwapChain Back Buffers
    6.3.9.3 Hazards Still Present During Execution


    6.3.1 Overview

    The concept of a Command List has been around in other graphics APIs, and partially supported by features in previous versions of Direct3D. Instead of immediately executing graphics commands (or giving the impression of such a model), the graphics commands are recorded for execution later. In the overview, the Deferred Context represents the facility to generate Command Lists. Command Lists work well when supporting multi-core CPUs. Command Lists can be generated by separate threads, although they must be manually executed via the render thread using the Immediate Context. The threading model is that a Context (either Immediate or Deferred) cannot be manipulated by more than one CPU thread simultaneously. Two Contexts, however, can be manipulated simultaneously, in parallel with each other, etc. After generation, a Command List can be used multiple times; but cannot be altered by the application explicitly. The interface for a Deferred Context is generally the same as the Immediate Context, with some exceptions. After work has been built up with a Deferred Context, the Command List must be generated by invoking Finalize. By default, Finalize will leave the Deferred Context in a zombie state, waiting for the Deferred Context to be destroyed. However, there will be an option to reset the Deferred Context and allow a new sequence of commands to be recorded, effectively re-creating the Deferred Context. If specialized IContext methods designed for the Immediate Context are invoked off a Deferred Context, they fail; and vice versa.

    6.3.2 Fire and Forget Model, No Feedback

    Since a Deferred Context is building up a deferred timeline for the GPU, the CPU must restrict itself to only sending data to the GPU in a fire-and-forget manor. Deferred Contexts cannot get any feedback from the GPU. Therefore, Resources cannot be Mapped, allowing read access. Query data cannot be retrieved, etc. Such operations can only be done by the rendering thread manipulating the Immediate Context, as the GPU is actually able to make forward progress and resolve the dependencies on data that the CPU requires.

    6.3.3 No Context State Inheritance

    State Inheritance refers to the ability of the Command List to inherit the current state of the Immediate Context when executed. No Immediate Context state (such as bound render targets nor shaders) can be inherited by the Command List. The state of the Deferred Context always starts out in the default Context state (i.e. equivalent to giving the new Deferred Context ClearState, as its first command or equivalent to the Immediate Context state immediately upon creation).

    6.3.4 No Context State Aftermath

    When a Command List is actually scheduled/ executed on either the Immediate or Deferred Context, the state of the Context (such as bound render targets and shaders) will altered afterward. The state of the Context will revert to the default Context state (ie. equivalent to executing ClearState implicitly immediate after Command List execution).

    6.3.5 Object State Inheritance & Aftermath

    While Command Lists and the Immediate Context state are effectively sheltered from each other, there is a form of Inheritance and Aftermath that needs to occur to make Command Lists useful: Resources and Query contents, etc. When a Command List executes on the Immediate Context, it inherits and can change the global state of objects, such as texture data, constant buffer data, and query data. Therefore it is possible to generate Command Lists that conditionally do different things, with creative use of Predicates and Resource data.

    6.3.6 Query Interactions

    Query data can be generated by Deferred Contexts, just as Render Target data is generated; and Queries can be wrapped around Command List execution. However, there are some problematic cases that need to be handled, assuming the Query syntax remains unchanged.

    First, for Queries that have a Beginning and an End, like Predicates, such bracketing must stay local to a particular Context (i.e. Begin & End must occur within same command timeline). It is not possible for a Begin to happen on one Context to be matched with an End on another Context or Command List. For example, problematic cases are exposed when a bracketing is begun in the Immediate Context and ended by a Command List, and vice versa. This is not allowed, and is enforced. If a Command List manipulates a Query (where the corresponding Deferred Context called Begin or End on the Query), the Command List execution will not be allowed on a Context where the same Query has only been Begun. In addition, any Queries that have been Begun in the Deferred Contexts but not Ended, are implicitly Ended by the invocation to Finalize.

    Second, when the Command List was being generated, was it assumed that the Command List execution could’ve been wrapped by any of the available Queries? This can be particularly troubling if a Query has hardware bugs related to it and needs some form of emulation. For example, if Blts are being emulated by the 3d pipeline, such operations are specified not to affect certain Queries. To satisfy the specification, the driver could poll any actively monitored counters and subtract off the Blt contribution from Query results. Such driver workarounds are hard to adapt to the Blts that may occur in a Command List. This does have implications on Software Command List implementations (i.e. it may not be known until Command List execution whether a software fallback will be leveraged, meaning the Deferred Context may need to build multiple types of Command Lists).

    6.3.7 Nested Command Lists

    Command Lists can call Command Lists, i.e. Execute can be called on a Deferred Context. Once Command List usage becomes popular, preventing nested Command Lists presents an obstacle to quickly offload code from the Immediate Context to a Deferred Context. Reducing the disparity between Deferred Context authoring and Immediate Context authoring, when possible, removes obstacles to Deferred Context usage. Infinite recursion is prevented naturally due to the separation of Command List and Deferred Context (i.e. in order to execute a Command List, the Deferred Context must be Finalized). This also means that nested Command Lists are finalized before they can be called by other Command Lists. There is no limit on the level of Command List indirection; but a practial limit on how deep can be realistically tested.

    Executing a Command List from a Deferred Context has the same State Aftermath as executing it on the Immediate Context: an implicit ClearState occurs. The Query restrictions that exist between Immediate Context and Deferred Context also exist for nested Command Lists.

    6.3.8 Allow Map Write on Resources with Restriction

    The restriction that Deferred Contexts cannot Map any Resource presents an obstacle to quickly offload code from the Immediate Context to a Deferred Context. Efficiently written software and middleware inevitably use dynamic resources for quick upload to the GPU. Such software would have separate code-paths in order to be Context-agnostic (i.e. run against an Immediate Context or a Deferred Context) if Map is completely disallowed. However, if the first invocation to Map for a Deferred Context was a discard, and all Map were Write-Only, these resource operations can be captured without conceptual complications. The entire operation can be converted to be analogous to the UpdateSubresource scenario on the same Deferred Context. Reducing the disparity between Deferred Context authoring and Immediate Context authoring, when possible, removes obstacles to Deferred Context usage.

    6.3.9 Application Immutable, but Patching is Still Required

    For all practical purposes, the application interprets the Command Lists as immutable, (i.e. constant after creation). However, there are some cases that could require modification of the Command List to some degree behind the scenes. These are forms of Resource renaming, though they are accomplished via different means.

    6.3.9.1 Discarded Dynamic Resources

    Even if Map were not allowed on the Deferred Context, there are still interactions between Command Lists and discarding Map that requires special attention. Imagine this code sequence:

        pData = pImmediateContext->Map( pDynamicBuffer, DISCARD );
        *pData = 1;
        pImmediateContext->Unmap( pDynamicBuffer );
    
        pDeferredContext = pDevice->CreateDeferredContext();
        pDeferredContext->CopyResource( pStagingBuffer, pDynamicBuffer );
        pDisplayList = pDeferredContext->Finalize();
    
        pData = pImmediateContext->Map( pDynamicBuffer, DISCARD );
        *pData = 2;
        pImmediateContext->Unmap( pDynamicBuffer );
    
        pImmediateContext->Execute( pDisplayList );
        pData = pImmediateContext->Map( pStagingBuffer, 0 );
    

    The contents of the staging Buffer must be 2, not 1.

    6.3.9.2 SwapChain Back Buffers

    The following case is similar to Dynamic Buffers. Even though Present is not allowed on the Deferred Context, there are still interactions between Command Lists and Present that requires special attention. Present rotates the identities of the back buffers, which naturally must affect any Command List that contains references to the Back Buffers.

    6.3.9.3 Hazards Still Present During Execution

    Resource read-after-write-hazards and other similar issues still need attention. One Command List could be executed which read from a Resource after another Display List that was executed which wrote to the same Resource. It may be feasible to do full pipeline flushes between the Command Lists which are used to achieve multi-CPU thread parallelism. A dual core probably only will execute one of these Command Lists per frame. But, Command Lists which are re-used will have a tendency to be smaller and used many times per frame. Full pipeline flushes may not be acceptable for such Command Lists.


    6.4 DDI Features and Changes


    Section Contents

    (back to chapter)

    6.4.1 Overview
    6.4.2 Thread Re-entrant Callback Routines
    6.4.3 Deferred Destruction
    6.4.4 Context Local Storage Handles
    6.4.5 Software Command List Assistance


    6.4.1 Overview

    The need to make certain DDI entry points thread re-entrant implies an increased awareness of threading at the DDI, and naturally, a myriad of changes to keep things efficient and reduce the propensity for bugs. With the increased usage of critical sections come the increased chances for deadlocks. For example, in D3D10, there was a well-defined ordering that critical sections must be acquired and released in, to prevent such deadlocks when holding critical sections simultaneously. If the following type of semantics (i.e. can one component hold a critical section during the invocation into another component) do not fall out of the general design of runtime and DDI, then there is increased burden of documentation and testing. If the API and callbacks could be designed such that the user mode driver needs no synchronization, internally, ensuring no deadlocks occur should be much easier.

    6.4.2 Thread Re-entrant Callback Routines

    With multiple threads in the user mode driver at one time, the DDI callbacks must be thread-safe. The DDI callbacks are generally thin wrappers around the thunks provided by DXGI. They isolate the driver from kernel handles and kernel function signatures. The kernel function signatures may change from OS release to OS release. D3D11 DDI callbacks have identical function signatures and functionality as D3D10 DDI callbacks. However, in contrast to D3D10 DDI callbacks, D3D11 DDI callbacks are designed to be free-threaded when used with a driver that support thread-safe creation. Callbacks used to satisfy creations will need to be thread re-entrant or provide thread re-entrant counterparts. Ideally D3D11 DDI callbacks would be completely free-threaded, but there are few restrictions that still remain. One restriction is that only a single thread can be working against a HCONTEXT at a time. Callbacks that use a HCONTEXT are pfnPresentCb, pfnRenderCb, pfnEscapeCb, pfnDestroyContextCb, pfnWaitForSynchronizationObjectCb, and pfnSignalSynchronizationObjectCb. Thus, if more than one thread is calling these callbacks using the same HCONTEXT, they are required to be synchronized. This is quite natural since these are callbacks that are likely to be called only from the thread that is manipulating the immediate context. Another restriction is that callbacks below are required to be invoked during DDI function calls using the same thread that called the DDI:

    pfnDeallocateCb deserves special mention, as it is not required to be called before the driver returns from D3D10DDI_DEVICEFUNCS::pfnDestroyResource for the majority of resource types. Since pfnDestroyResource is a free-threaded function, the driver must defer destruction of the object until it can be efficiently ensured that no existing immediate context reference remains (i.e. that pfnRenderCb is called before calling pfnDeallocateCb). This applies even to shared resources, or any other invocation using HRESOURCE to complement HRESOURCE usage with pfnAllocateCb; but does not apply to primaries.

    6.4.3 Deferred Destruction

    One of the basic tasks of the API is lifetime management of objects and handles. To stay efficient, the API prefers that object and handle destruction is deferred and amortized by default. Typically, deferment means until the GPU is no longer using the object. However, here, the term is meant to represent that the CPU is no longer using an object. The API will not delete an object whose ref count drops to 0. Instead, every flush of a command buffer gives the API an amortized opportunity to check to find those objects whose ref count is 0 and are no longer bound to the Immediate Context. This list of handles to delete can be provided to the driver to assist with an efficient flush. There may be additional mechanisms to destroy handles to suit all the needs of the API; but the guarantee will still exist that destroyed handles will not be currently bound to any context.

    6.4.4 Context Local Storage Handles

    The user mode driver has to manipulate data local to each object/ handle involved, in order to interact with the driver models. For example, allocation lists have to be built up to accompany command buffer submissions. Because all objects are now becoming nearly process-global, modifying data directly associated with these objects would require synchronization. It is more efficient to have an area of memory strongly associated with each object, but also local to a context, allowing CPU thread modification of memory without synchronization. The user mode driver can provide the size required for such memory, to gain efficiency with anything the runtime needs to allocate also.

    6.4.5 Software Command List Assistance

    The runtime provides a default implementation of the Deferred Context that will emulate Command List support. Even if all the API features can be supported directly in hardware, this does help bootstrap a driver faster. In addition, it can possibly be leveraged for debugging.


    7 Common Shader Internals


    Chapter Contents

    (back to top)

    7.1 Instruction Counts
    7.2 Common instruction set
    7.3 Temporary Storage
    7.4 Immediate Constants
    7.5 Constant Buffers
    7.6 Shader Output Type Interpretation
    7.7 Shader Input/Output
    7.8 Integer Instructions
    7.9 Floating Point Instructions
    7.10 Vector vs Scalar Instruction Set
    7.11 Uniform Indexing of Resources and Samplers
    7.12 Limitations on Flow Control and Subroutine Nesting
    7.13 Memory Addressing and Alignment Issues
    7.14 Shader Memory Consistency Model
    7.15 Shader-Internal Cycle Counter (Debug Only)
    7.16 Textures and Resource Loading
    7.17 Texture Load
    7.18 Texture Sampling
    7.19 Subroutines / Interfaces
    7.20 Low Precision Shader Support in D3D


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    Full details of the Shader models for each shader stage are provided in dedicated sections elsewhere in the spec. What follows is a discussion of a few general items (not an exhaustive list) that are common to all of the Shader models.

    7.1 Instruction Counts

    There are no limits on total shader program length or execution time (accounting for loops and subroutines), aside from any limitations in what may be expressed in the shader token format. Clearly longer programs will degrade in performance, but D3D11.3 currently does not specify how steeply performance will degrade relative to program length or execution time given that there are so many variables that might affect performance.

    7.2 Common instruction set

    Aside from a few exceptions, the instruction set for all the shader stages are identical. The exceptions are confined to instructions that only make sense in a given Shader unit. For example the sample instruction computes LOD based on derivatives, so sample and sample_b (sample with LOD bias) are only relevant in the Pixel Shader where derivatives are present, while sample_l (sample at selected LOD) and sample_d (sample with application-provided derivatives) is available in all stages.

    7.3 Temporary Storage

    Temporary storage is composed of a single Element type, which is a 4-tuple of untyped 32-bit quantities. Temporary storage consists of two classes of storage: registers, which are non-indexed single elements; and arrays, which are indexable 1D arrays of elements. Temporary storage is read/write, and is uninitialized at the start of a Shader execution instance. Reads of temporary storage that has not been previously written within a Shader execution instance return undefined values, but cannot return data outside of the address space of the device context.

    Temporary registers are declared(22.3.35) r#, and can be used as a temporary operand in D3D11.3 instructions.

    Temporary arrays are declared(22.3.36) as x#[n], where “n” is the array length (indexed with 0..n-1). Temporary arrays must be indexed by an r# scalar, statically indexed x# scalar, and/or and optional immediate constant (literal), and can have only one level of index nesting (e.g. x0[x1[r0.x+1].x+1] is not legal, but x0[x1[1].x+1] is legal). A temporary array reference, x#[?], can be used as a temporary operand in D3D11.3 instructions (i.e. anywhere an r# can be used). Out of bounds access to x#[?] is undefined, except that data outside the GPU process context is never visible.

    The total quantity of temporary storage per Shader execution instance is 4096 elements, which can be utilized in any combination of registers and arrays. i.e. the total number of r# and x# declared must be <= 4096.

    Note that the namespace for r# and x# (the #) are independent. e.g. Suppose r2 and x2[5] are declared. They are independent, but together both count as 6 units of storage against the limit of 4096 temporary registers.

    To provide a run-time stack, a program allocates a temporary array of a fixed size. The program should provide its own stack bounds checking, e.g., skip calls if the stack push would exceed the array bounds.

    There is no limit on the total number of times a temp registers (the same one or different ones) that can appear in a single instruction or in a shader.

    7.4 Immediate Constants

    For any instruction source argument that is capable of taking a temporary register, it is also permitted to supply 32-bit immediate scalar or 32-bit immediate 4-vector in the Shader code. Only at most one source operand per instruction may be specified using an immediate value (having up to 4 components). Immediate scalar values used in indexing of registers can only be used once per indexed operand in an instruction, and but these immediate values do not count against the limit of one immediate as a raw source operand. e.g. "add r0, v[1 + r0.x], float4(1.0f,2.0f,3.0f,4.0f)" is valid, since there is only one immediate source operand present (the float4), with the value 1 in the indexing of v[] not counting against the limit.

    If a source operand is a Constant Buffer reference (see Constant Buffers below), the reference to a Constant Buffer DOES count against the same limit as immediate values. This allows implementations to provide immediate values through the same hardware path as Constant Buffers if desired. e.g. "add r0, cb0[r1.x], float4(1.0f,2.0f,3.0f,4.0f)" is invalid, since both an immediate value is used as well as a Constant Buffer read in the same instruction.

    There is no limit on the total number of times immediate constants can appear in a single instruction or in a shader.

    7.5 Constant Buffers

    There are 15 slots for ConstantBuffers that can be active per Pipeline stage. Indexing across ConstantBuffers is not permitted. A given ConstantBuffer is accessed as an operand to any Shader operation as if it is an indexable read-only register in the Shader. Unlike other Buffer binding locations in the pipeline, Constant Buffers do not allow Buffer offsets nor custom strides. The stride of the Buffer is assumed to be the Element width of R32G32B32A32_TYPELESS; and the first Element in the Buffer (at Buffer offset zero) is assumed to constant #[ 0 ], when referenced from the Shader.

    In Shader code, just as a t# register is a placeholder for a Texture, a cb# register is a placeholder for a ConstantBuffer at "slot" #. A ConstantBuffer is accessed in a Shader using: cb#[index] as an operand to Shader instructions, where 'index' can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or a combination of the two, added together. e.g. "mov r0, cb3[x3[0].x+6]" represents moving Element 7 from the ConstantBuffer assigned to slot 3 into r0, assuming x3[0].x contains 1.

    There is no limit on the total number of times constant buffer reads (from any buffer and location in the buffer) that can appear in a single instruction or in a shader.

    The declaration of a ConstantBuffer (cb# register) in a Shader includes the following information:

    Out of bounds access to ConstantBuffers returns 0 in all components. Out of bounds behavior is always with respect to the size of the buffer bound at that slot.

    If the constant buffer bound to a slot is larger than the size declared in the shader for that slot, implementations are allowed to return incorrect data (not necessarily 0) for indices that are larger than the declared size but smaller than the buffer size.

    Fetching from a ConstantBuffer slot with no Buffer present always returns 0 in all components for all indices.

    With this set of information, different hardware implementations sporting varying degrees of optimization for ConstantBuffer access may make informed decisions about how to compile access to the ConstantBuffer into Shader code. Compiled shaders must never have to recompile just because different ConstantBuffers get bound to the Shader, as the necessary characteristics have been statically declared. Runtime validation (at least in debug) will ensure that the Shader code and the sizes of bound ConstantBuffers satisfy the declarations.

    The priorities assigned to ConstantBuffers assist hardware in best utilizing any dedicated constant data access paths/mechanisms, if present. There is no guarantee, however, that accesses to ConstantBuffers with higher priority will always be faster than lower priority ConstantBuffers. It is possible that a higher priority ConstantBuffer could produce slower performance than a lower priority ConstantBuffer, depending on the declared characteristics of the buffers involved. For example an implementation may have some arbitrary sized fast constant RAM not large enough for a couple of high priority ConstantBuffers that a Shader has declared, but large enough to fit a declared low priority ConstantBuffer. Such an implementation may have no choice but to use the standard (assumed slow) texture load path for large high priority ConstantBuffers (perhaps tweaking the cache behavior at least), while placing the lowest priority ConstantBuffer into the (assumed fast) constant RAM.

    Applications are able to write Shader code that reads constants in whatever pattern and quantity desired, while still allowing different hardware to easily achieve the best performance possible.

    7.5.1 Immediate Constant Buffer

    In addition to the aforementioned 15 slots for Constant Buffers, every shader program can declare(22.3.4) a single Immediate Constant Buffer with up to 4096 4-vector values. The data is tied to the shader program permanently, but otherwise behaves (gets accessed) by the shader exactly the same way as Constant Buffers.

    There is no limit on the total number of times immediate constant buffer reads (from any location the buffer) can appear in a single instruction or in a shader.

    7.6 Shader Output Type Interpretation

    The application is given control over the data type interpretation for Shader outputs (i.e. writing raw integer values vs. writing normalized float values) by simply choosing an appropriate format to interpret the output resource's contents as. See the Formats(19.1) section for detail.

    7.7 Shader Input/Output

    Details on Shader input/output registers (indeed all registers) are provided in the sections dedicated to each Shader unit elsewhere in the spec.

    One thing in common about input/output registers for all shaders is that if they are declared(22.3.30) to be dynamically indexable from the shader, and the shader indexes them out of the declared range, results are undefined, although no data from outside the GPU process context is never visible.


    7.8 Integer Instructions


    Section Contents

    (back to chapter)

    7.8.1 Overview
    7.8.2 Implementation Notes
    7.8.3 Bitwise Operations
    7.8.4 Integer Arithmetic Operations
    7.8.5 Integer/Float Conversion Operations
    7.8.6 Integer Addressing of Register Banks


    7.8.1 Overview

    There is a collection of instructions available to Shaders which are dedicated to performing integer arithmetic and bitwise operations. Operands and output registers for integer instructions can be any of the register classes available to the floating point instructions. There is no data type associated with registers; Shader instructions determine how the data stored in registers is interpreted. Integer instructions simply assume that the data being read from operands and written to the destination are all 32-bit values (unsigned or signed 2's complement, depending on the instruction).

    7.8.2 Implementation Notes

    Shader register storage is made up of 32-bit*4-component quantities, and integer arithmetic on these registers is required to be performed at full 32 bit in all cases.

    7.8.3 Bitwise Operations

    The bitwise instructions are listed in the Bitwise Instructions(22.11) sub-section of the full instruction listing.

    7.8.4 Integer Arithmetic Operations

    See the Integer Arithmetic Instructions(22.12) sub-section of the full instruction listing.

    7.8.5 Integer/Float Conversion Operations

    There is no implicit conversion between floating-point and integer values. Contents of registers are interpreted as float or ints by the particular instruction being executed. Two instructions exist that allow explicit conversions to be performed, listed in the Type Conversion Instructions(22.13) sub-section of the full instruction listing.

    7.8.6 Integer Addressing of Register Banks

    Integer offsets for reads from register banks are available. These offsets must be scalar values (i.e. a select swizzle must be used to select one component of any vector-valued register used as an index) and are considered to be unsigned 32 bit values.

    This indexing mechanism applied to indexable x# registers allows compilers to generate stack-like behavior for Shader subroutines.

    An example syntax for indexing is:

    mov r1, cb7[3+r2.x]

    This instruction assumes that an unsigned 32-bit integer value exists in r2.x, and uses that value to offset into ConstantBuffer 7, starting from location 3 in the ConstantBuffer. Thus, if r2.x contains integer value 2, entry 5 of ConstantBuffer 7 would be referenced.

    7.9 Floating Point Instructions

    Floating point instructions must follow the D3D11.3 Floating Point Rules(3.1).

    A listing of all floating point instructions can be found here(22.10).

    7.9.1 Float Rounding

    Instructions are provided for rounding floating point values to integral floating point values:

    round_ne(22.10.14) (nearest-even)
    round_ni(22.10.15) (negative-infinity)
    round_pi(22.10.16) (positive-infinity)
    round_z(22.10.17) (towards zero)

    7.10 Vector vs Scalar Instruction Set

    The D3D intermediate language (IL) and register model are 4-vec oriented. Since this does not constrain hardware implementation (vector vs scalar) too much, this convention will carry forward until a good reason to switch paradigms surfaces. It is known that many implementations actually happen to operate on scalars or combinations of layouts even now.

    One area where the vector assumption seems to materially impact data organization is the indexing of registers such as inputs or outputs – the indexing happens across registers. If it is important to be able to express cleanly how to index through an array of scalars, this could be an example of an argument for switching the IL to be completely scalar.


    7.11 Uniform Indexing of Resources and Samplers


    Section Contents

    (back to chapter)

    7.11.1 Overview
    7.11.2 Index Range
    7.11.3 Constant Buffer Indexing Example
    7.11.4 Resource/Buffer Indexing Example
    7.11.5 Sampler Indexing Example
    7.11.6 Resource Indexing Declarations


    7.11.1 Overview

    Shaders have bindpoint arrays for various classes of read-only input resources: Constant Buffers (cb), Texture/Buffers (t), Samplers (s).

    D3D11 allows all of these to be dynamically but uniformly indexed from a shader, whereas previously none of them were indexable.

    As with indexing of other types, such as indexable temps (x#), the dynamic index can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or the combination of the two, added together.

    The constraint on the indexing of resources or samplers is that the index must be uniform. That is, the computed index must be the same at that point in the lockstep execution of the program for all invocations of the shader within the Draw*() call. If due to flow control, some of the lockstep shader invocations are inactive, the computed index in those shaders is ignored and therefore cannot cause a violation of the uniform indexing constraint on all the active invocations.

    The HLSL compiler will enforce this behavior and driver compilers must not break it either. Violations of the uniform indexing constraint would be a result of an HLSL compiler bug or a driver compiler bug only, and in such cases the indexing results are undefined.

    7.11.2 Index Range

    Out of bounds resource indexing produces the same result as if accessing a slot with no resource bound.

    In particular note that with Constant Buffers, there are 14 API-visible Constant Buffer slots (a couple of other slots are reserved for various purposes). The valid indexing range for Constant Buffers is therefore [0..13], and accesses out of that range behave as if accessing a slot with no Constant Buffer bound.

    Out of bounds indexing of the Samplers (s#) results in undefined behavior.

    7.11.3 Constant Buffer Indexing Example

    Suppose x3[0].x contains 4 and x4[2].y contains 5. The following mov instruction:

    mov r0, cb[x3[0].x+6][x4[2].y+9] 

    is therefore equivalent to:

    mov r0, cb[10][14]

    which means read a 32-bit * 4-vector from location [14] in the ConstantBuffer, at ConstantBuffer bind point [10] (0-based counting).

    The uniform dynamic indexing of which Constant Buffer to read from is what was not supported previously. Dynamic indexing within the Constant Buffer itself has always been supported.

    7.11.4 Resource/Buffer Indexing Example

    Suppose x3[0].x contains 4. The following ld instruction:

    ld r0, r1, t[x3[0].x+6], texture2D

    is equivalent to:

    ld r0, r1, t[10], texture2D

    Note the "texture2D" at the end is also a new requirement, whereby all ld/sample instructions will indicate which Shader Resource View type is to be sampled.

    7.11.5 Sampler Indexing Example

    Suppose x3[0].x contains 4 and x4[2].y contains 5. The following sample instruction:

    sample r0, r1, t[x3[0].x+6], s[x4[2].y+9], textureCubeArray

    is equivalent to:

    sample r0, r1, t[10], s[14], textureCubeArray

    7.11.6 Resource Indexing Declarations

    Shader declarations from Shader Model 4.x for individual resources, constant buffers and samplers remain the same in Shader Model 5.0. These are particularly informative for parts of shader code that reference these objects directly, just as before.

    However, all instructions that reference texture objects (t#) now specify the view dimension (e.g. textureCubeArray) as a literal parameter. This is redundant when indexing is not used, since the up-front declaration of each t# has a view dimension, but useful when indexing is used.

    7.12 Limitations on Flow Control and Subroutine Nesting

    A flow control block is defined as an if(22.7.1) block, loop(22.7.4) block, or switch(22.7.18) block. Flow control blocks can nest up to 64 deep per subroutine (and main). Behavior of flow control instructions beyond this nesting limit is undefined.

    Subroutines can nest up to 32 deep. If there are already 32 entries on the return address stack and a "call" is issued, the call is skipped over.

    7.13 Memory Addressing and Alignment Issues

    For Typed memory views, the number of components in an address when accessed by a shader instruction is determined by the number of components in the resource dimension. Each address component is an unsigned 32-bit integer element index.

    For Raw memory views, the address is a single component unsigned 32-bit integer byte offset from the beginning of the view. The addresses must be 32-bit aligned. If an unaligned address is specified for an operation involving a write, the entire contents of the UAV(5.3.9) being written, or all of Thread Group Shared Memory (in the Compute Shader(18)) - whichever is being accessed - becomes undefined. If an unaligned address is specified for an operation involving a read, an undefined result is returned to the shader. It is invalid for implementations to perform the access as if there were no 32-bit alignment constraints.

    For Structured memory views, the address is two unsigned 32-bit integer values. The first value is the struct index, and the second value is a byte offset into the struct. The byte offset must be aligned to 32-bits, otherwise the same behavior described for misaligned raw memory access above applies.

    Each memory access instruction defines its behavior for out of bounds accesses, with distinctions for the memory location being accessed (UAV vs SRV vs Thread Group Shared Memory), and the layout (raw vs structured vs typed). See the documentation of individual instructions for details. The behaviors are similar for similar classes of instructions – e.g. all atomics have the same out of bounds behavior, all immediate atomics (which return a value to a shader) have their own consistent out of bounds access behavior, etc.


    7.14 Shader Memory Consistency Model


    Section Contents

    (back to chapter)

    7.14.1 Intro
    7.14.2 Atomicity
    7.14.3 Sync
    7.14.4 Global vs Group/Local Coherency on Non-Atomic UAV Reads


    7.14.1 Intro

    The types of memory accesses included in the scope of this chapter are: to Unordered Access Views(5.3.9) (UAVs, u#), available to the Compute Shader(18) and Pixel Shader(16), as well as Thread Group Shared Memory (g#), available to the Compute Shader.

    The D3D11 Shader Memory Consistency Model is weak/relaxed, as generally understood in existing architectures and literature. Loosely, this means the program author and/or compiler are responsible for identifying all memory and thread synchronization points via some appropriately expressive labeling.

    This section outlines how this weak/relaxed Memory Consistency Model appears to function from the point of view of D3D software.

    7.14.2 Atomicity

    An atomic operation may involve both reading from and then writing to a memory location. Atomic operations apply only to either u# (Unordered Access Views) or g# (Thread Group Shared Memory).

    It is guaranteed that when a thread issues an atomic operation on a memory address, no write to the same address from outside the current atomic operation by any thread can occur between the atomic read and write.

    If multiple atomic operations from different threads target the same address, the operations are serialized in an undefined order.

    Atomic operations do not imply a memory or thread fence. Fence operations (dubbed "sync") are introduced below. If the program author/compiler does not make appropriate use of fences, it is not guaranteed that all threads see the result of any given memory operation at the same time, or in any particular order with respect to updates to other memory addresses.

    Atomicity is implemented at 32-bit granularity. If a load or store operation spans more than 32-bits, the individual 32-bit operations are atomic, but not the whole.

    Limitation: Atomic operations on Thread Group Shared Memory are atomic with respect to other atomic operations, as well as operations that only perform reads ("load"s). However atomic operations on Thread Group Shared Memory are NOT atomic with respect to operations that perform only writes ("store"s) to memory. Mixing of atomics and stores on the same Thread Group Shared Memory address without thread synchronization and memory fencing between them produces undefined results at the address involved. This limitation arises because some implementations of loads and stores do not honor the locking semantics for implementing atomics. It turns out this has no impact on loads, since they are guaranteed to retrieve a value either before or after an atomic (they will not retrieve partially updated values, given they are all defined at 32-bit quanta). However store operations could find their way into the middle of an atomic operation and thus have their effect possibly lost.

    Note that there is no such limitation on atomics to UAV memory; atomic operations on UAV memory is atomic both with respect to other atomic operations as well as loads and stores.

    7.14.3 Sync

    A sync(22.17.7) instruction is included in the Shader IL for Pixel Shader and the Compute Shader.

    This provides memory fence semantics at various scopes, and optional thread group synchronization semantics (the latter only applies to the Compute Shader). For details, including some discussion of the implications see the description of the sync(22.17.7) instruction.

    7.14.4 Global vs Group/Local Coherency on Non-Atomic UAV Reads

    Typical implementations will have a cache hierarchy to improve read access performance on UAV(5.3.9) accesses. A constraint that some implementations have with the first stage in this cache hierarchy is that, in addition to operating at per-thread-group scope only, the cache does not have an efficient way of being synchronized with writes or atomics that have happened by other thread groups. Such behavior only surfaces as an issue for applications when cross-thread-group communication needs to be performed involving data loads. In this case, the hardware basically needs to know that it must bypass the first stage of caches on loads, reaching out to a more global memory so that the cross thread-group communication can function. D3D allows applications specify this cross-thread-group communication intent as follows.

    If a Compute Shader(18) thread in a given thread group needs to perform loads of data that was written by atomics or stores in another thread group, the UAV slot where the data resides must be tagged upon declaration in the shader as "globally coherent", so the implementation can ignore the local cache. Otherwise, this form of cross-thread group data sharing will produce undefined results.

    Atomic read-modify-write operations do not have this constraint (even though a part of the operation is a read/load), because a byproduct of the hardware honoring atomicity is that the entire system sees the operation, whereas simple loads on some implementations may only go to a local cache that has no knowledge of external updates.

    If a UAV is not declared as "globally coherent", it is only "group coherent", which means loads can only see data written by stores and atomics in other threads in the same thread group. The affected hardware knows it can make use of its thread-group specific caching for loads, since writes to the memory only came from the current thread group. A UAV tagged as "globally coherent" is also inherently obviously "group coherent", although the affected hardware would not use its local cache. As such, the "globally coherent" flag should only be specified when necessary.

    As a reminder though, to guarantee coherency on UAV accesses on all implementations, not only must shaders make the global vs group scope distinction discussed here upon UAV declaration, but they must also make appropriate use of memory and/or thread barriers ("sync_*" in the IL) as needed within in the shader to enforce proper ordering of operations by individual threads as seen by others. In addition, the "sync" operation has options for memory barriers that also distinguish between global vs group scope, but that control is separate from the topic of this section, and may not be exposed until a later time, as discussed in the sync instruction definition.

    Back to issue of global vs group coherency on non-atomic UAV reads. Importantly, for many scenarios where cross thread-group communication or reduction (such as histograms) can be accomplished using only atomic operations (no cross thread-group loads involved), there is no problem since atomic operations are implemented by all hardware in a globally coherent way, regardless of whether the UAV has been tagged as "globally coherent" or not.

    In the Pixel Shader(16), if a UAV is not declared as "globally coherent", it is only "locally coherent". "Local coherency" is the Pixel Shader’s equivalent of the Compute Shader’s "group coherency", except having scope limited only to a single Pixel Shader invocation. This indicates that the Pixel Shader is not doing any cross-PS-invocation communication involving simple load operations. Note, however, that in the Pixel Shader just like in the Compute Shader, atomic read-modify-write operations are always globally coherent. Indeed it is likely to be rare for a Pixel Shader or perhaps even the Compute Shader to need to declare a UAV as "globally coherent", given that atomic operations, which are always globally coherent, might provide the most practical mechanism for cross-PS-invocation or cross-group operations.


    7.15 Shader-Internal Cycle Counter (Debug Only)

    7.15.1 Basic Semantics

    To assist comparisons of algorithms running on GPUs during application development, a cycle counter can be read into shaders. The cycle counter is a 64-bit unsigned integer.

    The cycle counter appears as an additional 2*32-bit (64 bit total) input register type that can declared in any version 5.0+ shader. There are currently no native 64-bit integer arithmetic operations in shaders, although it is simple enough to emulate this. It may be fine for shaders to just look at the low 32-bits of the counter – this can be requested in the shader. Applications may also export the measurements using standard shader outputs for later analysis such as on the CPU.

    The counter is an implementation-dependent measure of cycles in the GPU engine, requiring care to interpret it usefully.

    7.15.2 Interpreting Cycle Counts

    For this discussion, consider a shader "invocation" to be a single execution of one shader program from beginning to end. For the Compute Shader however, an "invocation" is a single thread-group’s execution – e.g. the lifespan of the contents of thread-group shared memory.

    The initial value of the counter is undefined.

    A single reading of the cycle counter is meaningless. But any shader invocation can poll the counter value any number of times.

    Computing a delta from cycle counter readings within a shader invocation is meaningful.

    Computing a delta from cycle counter readings across separate shader invocations is not meaningful on all hardware. Developers must obtain information directly from IHVs about whether this is meaningful.

    The only IHV agnostic approach to interpreting the counters is to limit calculation of deltas to within a given shader invocation, and only make comparisons of deltas within or between shader invocations.

    There are plenty of reasons why test runs will execute differently. The obvious one is that execution of a shader can be interrupted by thread switching, so delta measurements will be arbitrarily larger than the number of cycles spent executing instructions in a given thread.

    There is no supported way to find out the frequency of the counter. There is no way to correlate this shader internal counter with external timers such as asynchronous time queries. The counter measurements cannot be correlated with measurements on different hardware by other hardware vendors or even necessarily the same vendor.

    If a GPU’s speed changes, such as for power saving, there is no way to know this happened, or its effect on cycle measurements.

    Beyond these hints about the care needed to interpret the counter, the onus is on developers to research the properties of new hardware designs that may affect measurements.

    7.15.3 Shader Compiler Constraints

    The HLSL shader compiler and driver compilers must treat reads of the cycle counter as barriers. Instructions can’t be moved across a counter read, and counter reads can’t be merged.

    7.15.4 Feature Availability

    The runtime enforces that shaders using this feature can only be created on a system with debug layer enabled. The debug layer is not allowed to be redistributed to end-user machines. The point is that shaders that use this counter are not intended to be shipped.

    7.15.5 Conformance

    This feature will not be tested on hardware by WHQL, except perhaps simply checking that drivers do not crash. Microsoft will test that the HLSL compiler output is correct.

    7.15.6 Shader Bytecode Details

    A new input register, vCycleCounter(22.3.29), can be declared in any version 5_0 (and beyond) shader:

    dcl_input vCycleCounter.{x|xy}.  

    Reading x yields the 32 LSBs of the 64-bit count, and reading y yields the 32 MSBs.

    This register can only be used as the source to a mov instruction, e.g. mov r0.w, vCycleCounter.x.


    7.16 Textures and Resource Loading

    Up to 128 Resources (e.g. Buffer, Texture1D/2D/3D/Cube) can be active per Pipeline stage. A Resource binding is a representation of a Resource's base pointer (and other data such as size and pixel layout) and is independent of the samplers.

    A texture out of a set of bound textures cannot be selected via Shader indexing, however Texture1D/2D/3D resources with an Array dimension > 1, or TextureCube (which has an Array dimension of 6), allow indexing along the array axis from within Shader code.

    Textures can only have a single Element format. Likewise, Buffers used as input to Shaders can also only have a single Element format, and have an implied data stride equal to the Element size. A single Buffer (or Texture) could be set to multiple input slots simultaneously, with different Element formats and/or offsets, however because Buffers bound as Shader inputs have their data stride implied by the Element format, it is not possible to describe "Array-of-Structures" style layouts in Buffers bound at Shader input. This unlike the Input Assembler Stage, where multiple element Buffers are permitted, and Element offsets and strides can be defined Buffers freely.

    Data from textures is accessed in shaders via the load (ld) and sample instructions. The ld instruction provides a simple read and (optional) float32 conversion of texture data using integral addresses, while the sample instructions use normalized floating point addressing and perform filtering in addition to the format conversion.

    7.17 Texture Load

    The load operation performs a non-filtered read of resource data. See the ld(22.4.6) instruction definition for details.

    7.17.1 Multisample Resource Load

    Multisample resources can be set as shader inputs, which allows individual samples to be read by the shader. Support for multisample shader reads has the following restrictions:

    See ld(22.4.6) and dcl_resource(22.3.12) definitions for details.


    7.18 Texture Sampling


    Section Contents

    (back to chapter)

    7.18.1 Overview
    7.18.2 Samplers
    7.18.3 Sampler State
    7.18.4 Normalized-Space Texture Coordinate Magnitude vs. Maximum Texture Size
    7.18.5 Processing Normalized Texture Coordinates
    7.18.6 Reducing Texture Coordinate Range
    7.18.7 Point Sample Addressing
    7.18.8 Linear Sample Addressing
    7.18.9 Texture Address Processing

    7.18.9.1 Border Color
    7.18.10 Mipmap Selection
    7.18.11 LOD Calculations
    7.18.12 TextureCube Edge and Corner Handling
    7.18.13 Anisotropic Filtering of TextureCubes
    7.18.14 Sample Return Value Type Interpretation
    7.18.15 Comparison Filtering
    7.18.15.1 Shadow Buffer Exposure on Feature Level 9.x
    7.18.15.1.1 Mapping the Shadow Buffer Scenario to the D3D9 DDI
    7.18.15.1.2 Checking for Shadow Support on Feature Level 9.x
    7.18.16 Texture Sampling Precision
    7.18.16.1 Texture Addressing and LOD Precision
    7.18.16.2 Texture Filtering Arithmetic Precision
    7.18.16.3 General Texture Sampling Invariants
    7.18.17 Sampling Unbound Data


    7.18.1 Overview

    This section describes the mechanics of sampling Texture1D/2D/3D/Cube resources using filtering. The simplest form of sampling a texture is point sampling, supported for all data formats, however more complex filtering operations are only available to some formats, indicated in the format list in the Formats(19.1) section.

    The behaviors described here are obtained via the the various sample* instructions, such as sample(22.4.15). See the specs for those instructions for further details that complement this section.

    Unless otherwise noted, all texture sampling address operations are performed according to the arithmetic processing rules described in the Basics(3) section.

    Texture filtering theory or historical background is NOT provided in this spec.

    Note that details of all required texture filtering algorithms are not fully/exactly specified for this version of D3D11.3; the specs below only explicitly define a subset of all filtering features available in D3D11.3.

    7.18.2 Samplers

    Samplers identify filtering modes and other sampler state, described below. Samplers are not indexable from within shaders. There are 16 samplers "slots" per Pipeline stage, to which "Sampler Objects" can be arbitrarily assigned/reassigned.

    The state for a sampler is encapsulated in a "sampler object", up to 4096 of which can be created through the API. At the time a sampler object is created, all of its state must be chosen permanently, and can never be changed. These sampler objects can be arbitrarily assigned to any of the 16 "sampler slots" at each of the Shader stages (a single sampler object is allowed to be assigned to multiple sampler slots, even on multiple pipelines stages simultaneously, if desired.

    The reason Sampler Objects are statically created, and there is a limit on the number that can be created, is to enable hardware to maintain references to multiple samplers in flight in the Pipeline, without having to track changes or flush the Pipeline, which would be necessary if Sampler Objects were allowed to be edited.

    7.18.3 Sampler State

    typedef enum D3D11_FILTER
    {
        // Bits used in defining enumeration of valid filters:
        // bits [1:0] - mip: 0 == point, 1 == linear, 2,3 unused
        // bits [3:2] - mag: 0 == point, 1 == linear, 2,3 unused
        // bits [5:4] - min: 0 == point, 1 == linear, 2,3 unused
        // bit  [6]   - aniso
        // bit  [7]   - comparison
        // bits [8:7] - reduction type:
        //                0 == standard filtering
        //                1 == comparison
        //                2 == min
        //                3 == max
        // bit  [31]  - mono 1-bit (narrow-purpose filter) [no longer supported in D3D11]
    
        D3D11_FILTER_MIN_MAG_MIP_POINT                              = 0x00000000,
        D3D11_FILTER_MIN_MAG_POINT_MIP_LINEAR                       = 0x00000001,
        D3D11_FILTER_MIN_POINT_MAG_LINEAR_MIP_POINT                 = 0x00000004,
        D3D11_FILTER_MIN_POINT_MAG_MIP_LINEAR                       = 0x00000005,
        D3D11_FILTER_MIN_LINEAR_MAG_MIP_POINT                       = 0x00000010,
        D3D11_FILTER_MIN_LINEAR_MAG_POINT_MIP_LINEAR                = 0x00000011,
        D3D11_FILTER_MIN_MAG_LINEAR_MIP_POINT                       = 0x00000014,
        D3D11_FILTER_MIN_MAG_MIP_LINEAR                             = 0x00000015,
        D3D11_FILTER_ANISOTROPIC                                    = 0x00000055,
        D3D11_FILTER_COMPARISON_MIN_MAG_MIP_POINT                   = 0x00000080,
        D3D11_FILTER_COMPARISON_MIN_MAG_POINT_MIP_LINEAR            = 0x00000081,
        D3D11_FILTER_COMPARISON_MIN_POINT_MAG_LINEAR_MIP_POINT      = 0x00000084,
        D3D11_FILTER_COMPARISON_MIN_POINT_MAG_MIP_LINEAR            = 0x00000085,
        D3D11_FILTER_COMPARISON_MIN_LINEAR_MAG_MIP_POINT            = 0x00000090,
        D3D11_FILTER_COMPARISON_MIN_LINEAR_MAG_POINT_MIP_LINEAR     = 0x00000091,
        D3D11_FILTER_COMPARISON_MIN_MAG_LINEAR_MIP_POINT            = 0x00000094,
        D3D11_FILTER_COMPARISON_MIN_MAG_MIP_LINEAR                  = 0x00000095,
        D3D11_FILTER_COMPARISON_ANISOTROPIC                         = 0x000000d5,
        D3D11_FILTER_MINIMUM_MIN_MAG_MIP_POINT                      = 0x00000100,
        D3D11_FILTER_MINIMUM_MIN_MAG_POINT_MIP_LINEAR               = 0x00000101,
        D3D11_FILTER_MINIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT         = 0x00000104,
        D3D11_FILTER_MINIMUM_MIN_POINT_MAG_MIP_LINEAR               = 0x00000105,
        D3D11_FILTER_MINIMUM_MIN_LINEAR_MAG_MIP_POINT               = 0x00000110,
        D3D11_FILTER_MINIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR        = 0x00000111,
        D3D11_FILTER_MINIMUM_MIN_MAG_LINEAR_MIP_POINT               = 0x00000114,
        D3D11_FILTER_MINIMUM_MIN_MAG_MIP_LINEAR                     = 0x00000115,
        D3D11_FILTER_MINIMUM_ANISOTROPIC                            = 0x00000155,
        D3D11_FILTER_MAXIMUM_MIN_MAG_MIP_POINT                      = 0x00000180,
        D3D11_FILTER_MAXIMUM_MIN_MAG_POINT_MIP_LINEAR               = 0x00000181,
        D3D11_FILTER_MAXIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT         = 0x00000184,
        D3D11_FILTER_MAXIMUM_MIN_POINT_MAG_MIP_LINEAR               = 0x00000185,
        D3D11_FILTER_MAXIMUM_MIN_LINEAR_MAG_MIP_POINT               = 0x00000190,
        D3D11_FILTER_MAXIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR        = 0x00000191,
        D3D11_FILTER_MAXIMUM_MIN_MAG_LINEAR_MIP_POINT               = 0x00000194,
        D3D11_FILTER_MAXIMUM_MIN_MAG_MIP_LINEAR                     = 0x00000195,
        D3D11_FILTER_MAXIMUM_ANISOTROPIC                            = 0x000001d5
    } D3D11_FILTER;
    
    typedef enum D3D11_TEXTURE_ADDRESS_MODE
    {
        D3D11_TEXADDRESS_WRAP         = 1,
        D3D11_TEXADDRESS_MIRROR       = 2,
        D3D11_TEXADDRESS_CLAMP        = 3,
        D3D11_TEXADDRESS_BORDER       = 4,
        D3D11_TEXADDRESS_MIRRORONCE   = 5
    } D3D11_TEXTURE_ADDRESS_MODE;
    
    typedef struct D3D11_SAMPLER_STATE
    {
        D3D11_FILTER                  Filter;
        D3D11_TEXTURE_ADDRESS_MODE    AddressU; // U coordinate address mode
        D3D11_TEXTURE_ADDRESS_MODE    AddressV; // V coordinate address mode
        D3D11_TEXTURE_ADDRESS_MODE    AddressW; // W coordinate address mode
        float                         MinLOD;
        float                         MaxLOD;
        float                         MipLODBias; // (-16.0f..15.99f)
        DWORD                         MaxAnisotropy;  // (0 - 16)
        D3D11_COMPARISON_FUNC         ComparisonFunction; // for Percentage-Closer filter
        float                         BorderColor[4]; // R,G,B,A
    } D3D11_SAMPLER_STATE;
    
    

    See the Sampler Declaration Statement(22.3.34) in the shader instruction reference for a description of which sampler states are honored depending on the choice of Filter setting, and a description of which sampler* instructions in the shader are permitted to reference samplers configured various ways.

    7.18.4 Normalized-Space Texture Coordinate Magnitude vs. Maximum Texture Size

    The magnitude of normalized-space texture coordinates (allowing for texture tiling) has no effect on the maximum supportable texture dimensions that can be sampled. The only catch is that as the absolute magnitude of a normalized-space texture coordinate gets larger (e.g. large amounts of tiling), floating point dictates that less precision will be available to resolve individual texels in a given tiling of the texture being sampled. Large amounts of tiling of large dimension textures will yield sampling artifacts where float32 precision becomes inadequate. But separate from this tradoff, in order to otherwise achieve decoupling of the magnitude of normalized-space texture coordinates from having any effect on maximum texture dimension that can be sampled given float32 normalized-space addressing, a range reduction to about [-10...10], depending on the scenario, is applied on the texture coordinates.

    Details of this range reduction are described later(7.18.6). The reduction happens before scaling texture coordinates by texture size, conversion to fixed point, and final application of Texture Address modes (CLAMP/MIRROR/WRAP etc.) on texel addresses. The range reduction allows the fixed point representation to not have to dedicate storage for the texture tiling. It is important to note that range reduction is a separate step from applying Texture Address mode (although the particular Texture Address mode affects what type of reduction gets used).

    Using range reduction to decouple texture coordinate magnitude from supportable texture size has the following implication: The maximum texture dimension possible to be sampled in D3D11.3 is 2^17. This limit is derived starting with 24 bits of float32 fractional precision for the original texture coordinate, subtracting required subtexel precision (8 bits), and subtracting 1 more bit due to the factor of 2 scaling in the reduced range. Of course, the minimum upper limit for filterable texture dimension required to be exposed by all D3D11.3 implementations is far smaller, at only 16384 (see System Limits(21)).

    7.18.5 Processing Normalized Texture Coordinates

    This section describes in general how to convert a normalized texture coordinate to a texture address. The description is based on sampling a Texture1D, but applies equally to Texture2D and Texture3D (and not TextureCubes).

    A normalized texture coordinate (U) maps the range [0, 1] to the range [0, numTexelsU], where numTexelsU is the size of a 1D texture in texels. The process of computing a texture address is as follows:

    7.18.6 Reducing Texture Coordinate Range

    To limit the number of bits needed to store the texture coordinate in fixed point after conversion from floating point, the range of the normalized texture coordinate is reduced to be within [-10,10], depending on the Address mode. This removes the magnitude of texture tiling from the texture coordinate, while not affecting the behavior of texture address wrap modes. The same address mode handling can be applied to the range reduced texture coordinate as the original, producing the same result. The benefit is that the magnitude of texture tiling is not stored in the coordinate at the same time that texture size scaling is performed on the coordinate. This enables far larger texture coordinate range to be handled cleanly than would otherwise be possible without reduction.

    Note that the range reductions applied here in some cases leave a bit of extra padding (up to [-10,10] mentioned). This padding allows for the fact that after scaling by texture size, the selection of texels for point or linear sample kernels involves picking texel(s) to the left and/or right of the sample location, so coordinates that are not near the boundaries of the addresing mode must not appear as if they are on the boundary. e.g. Consider Linear sampling a coordinate that straddles a border when in BORDER mode: this needs to pick up the Border Color for 1/2 of the samples and the interior edge of the texture for the other 1/2. However range reduction cannot just clamp to [0..1) for BORDER mode, because it would make coordinates that fall completely into BORDER territory incorrectly behave as if they straddle the border (picking up some contribution of Border Color and interior). Range reduction has to also allow for immediate texel offsets permitted in shader code Range reduction does not change expected texture sampling behavior; it just helps keep the sequence of floating point operations on texture coordinates within manageable range.

    The following logic describes how normalized texture coordinate range reduction is performed. (This is different form final Texture Address Processing(7.18.9), which happens a couple of steps later, on scaled coordinates that identify texels.)

    Given:
    float signedFrac(float f) returns (f - round_z(f)) // round_z : "round towards zero"
    float frac(float f) returns (f - round_ni(f))      // round_ni : "round towards negative infinity"
    
    We have:
    
    float ReduceRange(float U, D3D11_TEXTURE_ADDRESS_MODE AddressMode)
    {
        switch (AddressMode)
        {
        case D3D11_TEXTURE_ADDRESS_WRAP:
            // The reduced range is [0, 1)
            return frac(U);
        case D3D11_TEXTURE_ADDRESS_MIRROR:
            // The reduced range is (-2, 2)
            return signedFrac(U/2) * 2;
        case D3D11_TEXTURE_ADDRESS_MIRRORONCE:
        case D3D11_TEXTURE_ADDRESS_CLAMP:
        case D3D11_TEXTURE_ADDRESS_BORDER:
            // The reduced range is [-10, 10].
            // Each of these modes might use different tightnesses of reduced range,
            // but since there really is no benefit in that, a one-size-fits-all
            // approach is taken here.
            // Note that the range leaves room for immediate texel-space offsets
            // supported by sample instructions, [-8...7],
            // preventing these offsets from causing texcoords that clearly should
            // be out of range (i.e. in border/clamp region) from falling within
            // range after range reduction.  The point is that range reduction does
            // not have an affect on the texels that are supposed to be chosen.
            if(U <= -10)
                return -10;
            else if(U >= 10)
                return 10;
            else return U;
        }
        return 0;
    }
    

    Note that the amount of padding supported here for mirroronce/clamp/border are only feasible for use with point or linear filtering of a texture (a larger kernel becomes more likely to expose the reduced range boundary), including with immediate texel offsets from the shader. Furthermore, complex filters which use point or linear filter taps as building blocks (key example being Anisotropic Texture Filtering) are perfectly compatible with the specified range reduction. The reason is that such filters choose their "taps" by perturbing normalized texture coordinates (e.g. walking the line of anisotropy in Anisotropic Texture Filtering), and thus each pertubed "tap" individually goes through the range reduction described here before application of the usual Point/Linear Sample Addressing logic and Texture Address Processing described below.

    7.18.7 Point Sample Addressing

    Setting aside how sampler state is configured and how mipmap LOD is chosen, consider simply the task of point sampling an Element from a particular miplevel of a Texture1D, given a scalar floating point texture coordinate in normalized space. In the Texture Coordinate Interpretation(3.3.3) section, there is a diagram illustrating generally how a 1D texture coordinates maps to a texel (not accounting for wrapping). Note from the "Texture Coordinate System" diagram shown that texel corners have integral coordinates in texel-space, and so texel centers are at half-units away from the corners. Point sampling selects the "nearest" texel based on the proximity of texel centers to the texture coordinate (keeping in mind that texel centers are at half-units):

    For Texture2D and Texture3D Resources, the same rules apply independently on the other dimensions.

    For TextureCube Resources, the following occurs:

    7.18.8 Linear Sample Addressing

    Similar to the previous section, set aside how sampler state is configured and how mipmap LOD is chosen for now, and consider simply the task of linear sampling an Element from a particular miplevel of a Texture1D, given a scalar floating point texture coordinate in normalized space. Linear sampling in 1D selects the nearest two texels to the sample location and weights the texels based on the proximity of the sample location to them.

    The procedure described above applies to linear sampling of a given miplevel of a Texture2D as well:

    Performing linear sampling of a miplevel of a Texture3D Resource extends the concepts described above to fetching of 8 texels.

    In the case of a TextureCube, see the section regarding TextureCube Edge and Corner Handling(7.18.12)

    7.18.9 Texture Address Processing

    The sample* instructions provide texture coordinates in normalized floating point form, such that values in [0..1] range span a given dimension of a texture, and values outside this range fall off the borders of the texture. Later in the filtering process, when individual texels are fetched, if the address is outside the extents of the texture, either the address gets mapped back into range by the texture address mode for each component, or the border-color is used. The texture address mode is defined by the AddressU, AddressV, and AddressW members of D3D11_SAMPLER_STATE.

    Consider the moment in the process of sampling of a Texture1D just after picking a particular integer address scaledU to fetch a texel from (details on choosing sample locations described elsewhere for various filter modes). Suppose the texel address scaledU falls off the Texture1D, meaning either (scaledU < 0), or (scaledU > numTexelsU - 1), where numTexelsU is the count of texels in the U dimension of the Texture1D. The following pseudocode describes how the setting on D3D11_SAMPLER_STATE member AddressU gets applied on scaledU:

    if ((scaledU < 0) || (scaledU > numTexelsU-1))
    {
        switch (AddressU)
        {
        case D3D11_TEXADDRESS_WRAP:
            scaledU = scaledU % numTexelsU;
            if(scaledU < 0)
                scaledU += numTexelsU;
            break;
        case D3D11_TEXADDRESS_MIRROR:
    
            {
                if(scaledU < 0)
                    scaledU = -scaledU - 1;
                bool Flip = (scaledU/numTexelsU) & 1;
                scaledU %= numTexelsU;
                if( Flip ) // Odd tile
                    scaledU = numTexelsU - scaledU - 1;
                break;
            }
        case D3D11_TEXADDRESS_CLAMP:
            scaledU = max( 0, min( scaledU, numTexelsU - 1 ) );
            break;
        case D3D11_TEXADDRESS_MIRRORONCE:
            if(scaledU < 0)
                scaledU = -scaledU - 1;
            scaledU = max( 0, min( scaledU, numTexelsU - 1 ) );
            break;
        case D3D11_TEXADDRESS_BORDER:
            // Special case: Instead of fetching from the texture,
            // use the Border Color(7.18.9.1).
            bUseBorderColor = true;
            break;
        default:
            scaledU = 0;
        }
    }
    

    For Texture2D and Texture3D, all of the above modes apply to the V and W dimensions independently, based on AddressV and AddressW. If any single dimension selects Border Color, then the Border Color(7.18.9.1) is applied.

    7.18.9.1 Border Color

    Border Color values are defined in the DDI via 4 floating point values (RGBA), in linear space. The Border Color used in filtering is snapped to the precision the hardware performs filtering at for the format.

    Note that the only components of the BorderColor used by filtering hardware are the ones present in the resource format description.

    For example, suppose the resource format is DXGI_FORMAT_R8_SNORM, and BorderColor is needed during a sample operation. In this case only the RED component of BorderColor is used, along with the appropriate format-specific defaults for the other components. The BorderColor (the red part in this case) is taken as floating-point data and clamped into the range of the format before filtering. In this case, the red part of the BorderColor is clamped to [-1.0f,1.0f] range before being used by the filtering hardware. From this point (entering the filtering hardware) onward, the fact that BorderColor is being used has no more behavioral effect.

    7.18.10 Mipmap Selection

    Suppose the task at hand is to choose a mipmap level from a Resource, given a floating point LOD value. The choice of mipmap level is based on the particular choice of filter mode in the Sampler State(7.18.3); in which the possible choices are POINT and LINEAR. Anisotropic texture filtering uses LINEAR mipmap selection.

    7.18.11 LOD Calculations

    This section describes how LOD is computed as part of sample* instructions involving filtering.

    7.18.12 TextureCube Edge and Corner Handling

    TextureCube filtering near Cube edges, where 2x2 (bilinear) filter taps would fall off a face are required to spill over by one texel row/column to the appropriate adjacent map.

    At TextureCube corners, a linear combination of the three relevant samples is required. The ideal (reference) linear combination of the three samples in the corner case is as follows: Imagine flattening out the Cube faces at the corner, yielding 3 texels and a missing one. Apply bilinear weights on this virtual grid of 4 texels, and then divide the weight for the missing texel evenly amongst the 3 other texels. It is alternatively permissible for an implementation to, instead of dividing the weight evenly amongst the 3 other texels, just split the weight of the missing texel across the 2 adjacent texels. However in future versions of D3D, only the reference behavior will be permitted.

    7.18.13 Anisotropic Filtering of TextureCubes

    Anisotropic texture filtering on a TextureCube does not have specified/required behavior except that it must at least behave no "worse" than tri-linear filtering would.

    7.18.14 Sample Return Value Type Interpretation

    The application is given control over the return type of texture load instructions (i.e. reading raw integer values vs. reading normalized float values) by simply choosing an appropriate format to interpret the resource's contents as. See the Formats(19.1) section for detail.

    7.18.15 Comparison Filtering

    For details on comparison filtering, see the sample_c(22.4.19) and sample_c_lz(22.4.20) instructions.

    Comparision Filtering is an attempt by D3D11.3 to define basic building-block filtering operation that is useful for Percentage Closer Depth Filtering.

    7.18.15.1 Shadow Buffer Exposure on Feature Level 9.x

    D3D9 never officially supported dedicated hardware support for shadow map scenarios. Namely, D3D9 does not spec the ability to bind a depth buffer as a shader input and to sample from it using comparision filtering (also known as "Percentage Closer Filtering"). Even though this never made it into the D3D9 spec, the D3D9 runtime intentionally used loose validation to enabled IHVs to align on a convention for how to make the feature work.

    In the meantime, the D3D10+ hardware spec added a requirement for supporting binding depth as a texture and for comparison filtering.

    As more scenarios arise involving the D3D11+ APIs running on Feature Level 9.x it finally makes sense to expose the D3D9 shadow buffer support. It turns out this is possible simply by loosening validation on existing API constructs in the D3D11.1+ API for depth buffers and comparision filtering, mapping to the equivalent on the D3D9 convention IHVs had aligned on where applicable.

    When Feature Level 9.x is used at the D3D11.1+ API (meaning the D3D9 DDI is used) on a Win8+ driver, regardless of hardware feature level, applications can do the following:

    The overbearing validation described above (dropping Draw calls when state is invalid) helps ensure that an application that can get shadows working at Feature Level 9.x will behave the same if the Feature Level is bumped up to 10+ with no code change required.

    The reason this feature is limited to Win8+ drivers (regardless of hardware feature level) is to avoid having to test on any old D3D9 hardware that is unlikely to be driven by the D3D11.1 APIs in the first place.

    7.18.15.1.1 Mapping the Shadow Buffer Scenario to the D3D9 DDI

    The D3D11.1 runtime maps this shadow scenario to the D3D9 DDI (regardless of hardware feature level) as follows.

    This feature was added too late to enforce via hardware conformance kit testing. However all hardware vendors at the time of shipping agreed to support it, and tests are being authored to assist with basic verification (even if not enforced for now).

    7.18.15.1.2 Checking for Shadow Support on Feature Level 9.x

    The D3D11 CheckFeatureSupport() API has a new capability that can be checked: D3D11_FEATURE_D3D9_SHADOW_SUPPORT. This is set to true if the driver is Win8+ (no need to ask the driver anything else).

    On the other hand if the D3D11 CheckFeatureSupport() / CheckFormatSupport() APIs are used to query format support on the individual DXGI_FORMAT_* names described here, the runtime will NOT report support for any capabilities specific to the shadow buffer scenario. For example support for using DXGI_FORMAT_R16_UNORM as a texture is not reported on Feature Level 9.1/9.2 (though it is supported on 9.3, independent of the shadow scenario).

    Not reporting shadow support on format caps queries was a simplification. It avoids conflicts where this depth scenario allows operations with format names that are not allowed in non-shadow cases, particularly for DXGI_FORMAT_R16_UNORM. It was not worth disambiguating the format caps reporting for this unique case. The bottom line is all an application needs to do is check the D3D11_FEATURE_D3D9_SHADOW_SUPPORT cap described above to know if the entire scenario will work.

    7.18.16 Texture Sampling Precision

    7.18.16.1 Texture Addressing and LOD Precision

    During Texture Sampling(7.18), the amount of range required for selecting texels (after scaling normalized texture coordinates by texture size) is at least 216. This range is centered around 0.

    The amount of subtexel precision required (after scaling texture coordinates by texture size) is at least 8-bits of fractional precision (28 subdivisions).

    In mipmap selection, after conversion from float, at least 8-bits must represent the integer component of the LOD, and at least 8-bits must represent the fractional component of an LOD (28 subdivisions).

    See the discussion in the Fixed Point Integers(3.2.4) section on how fixed point numbers should be defined and how it relates to texture coordinate precision.

    7.18.16.2 Texture Filtering Arithmetic Precision

    All of the texture filtering operations in D3D11.3, when being performed on floating point formats (regardless of format width), are required to follow the D3D11.3 Floating Point Rules(3.1), with one exception: When a filter weight of 0.0 is encountered, NaN's or signed zeros may or may not be propagated from the source texture.

    Texture filtering operations performed on fixed point formats must be done with at least as much precision as the format.

    7.18.16.3 General Texture Sampling Invariants

    Here are some general observations about things that can be expected of texture filtering operations.

    7.18.17 Sampling Unbound Data

    Sampling from a slot with no texture bound returns 0 in all components.


    7.19 Subroutines / Interfaces


    Section Contents

    (back to chapter)

    7.19.1 Overview
    7.19.2 Differences from 'Real' Subroutines
    7.19.3 Subroutines: Non-goals
    7.19.4 Subroutines - Instruction Reference
    7.19.5 Simple Example

    7.19.5.1 HLSL - Simple Example
    7.19.5.2 IL - Simple Example
    7.19.5.3 API - Simple Example
    7.19.6 Runtime API for Interfaces
    7.19.6.1 Overview
    7.19.6.2 Prototype of changes
    7.19.7 Complex Example
    7.19.7.1 HLSL - Complex Example
    7.19.7.2 IL - Complex Example
    7.19.7.3 API - Complex Example


    7.19.1 Overview

    The programmable graphics pipeline has given software developers greatly enhanced flexibility and power. As a result, shader programming has evolved to the point where programmers need to combine multiple code building blocks (i.e. subroutines) on the fly. Current approaches generally cause the static creation of thousands of one-off shaders, each using a particular combination of subroutines to realize a specific effect. The use of flow control and looping can reduce the number of these precompiled combinations, but these techniques have a dramatic effect on the runtime performance of the shader code, and applications are still sensitive to the extra instructions and registers used in common shaders. Furthermore, since the shader programs are "kernels" or inner loops, any extra overhead for trying to reuse the same instruction stream to represent multiple combinations is more noticeable than in more traditional CPU code. The application developer has no way of knowing when it is safe, in regards to performance, to use flow control to mitigate code complexity. This leads to a different performance problem: dealing with of thousands of shaders.

    The goal of this feature is to allow applications to have a simple, expressive programming model that abstracts away this combinatoric complexity while still achieving the performance of the custom precompiled shaders. To achieve this goal, we move the complexity from the application level to the driver level where hardware-specific knowledge can be utilized to reduce program size and complexity.

    To satisfy the performance requirements of inner loop code, the overhead of calling conventions and lost optimizations needs to be addressed. Our method avoids the overhead by using a subroutine model that virtually "inlines" the functions that can be called. This is done by compiling code normally up to a call site, and then compiling all possible callees with the current state of the caller. The functions called would then be optimized for the current register state by mapping inputs and outputs to their current register locations. While this approach increases overall program size, it avoids the cost of both parameter passing and stack save/restore, thereby avoiding the overhead of traditional function calls while preserving runtime flexibility.

    The IL ASM has code blocks that act and look like subroutines; there are defined in/out parameters and registers are all local (in/out/temp/scratch). Some global references remain: textures, constant buffers, and sampler. The main difference from normal subroutines is that each location that can call a subroutine has a declaration describing the call destinations that are possible.

    The set of functions to call when executing a given shader program can be changed between draw calls when calling SetShader. When binding the shader program to the pipeline, the list of functions to use is specified. Selecting the set of functions to use between draw calls allows the driver to recalculate the hardware requirements for a specified set of functions. Calculating the true number of registers required for a given "specialization" of a shader provides the combined flexibility of choice at runtime and the performance of a specialized shader.

    7.19.2 Differences from 'Real' Subroutines

    The primary difference of this approach from "real" subroutines is that at runtime no calling convention is used. Each time a function could be called, a version of the function is emitted to match the caller’s register and other state. Since a new version of the callee is emitted for each location in the caller code that the function is called from, all optimizations used when inlining apply, except that callee code must remain functionally separate from caller code.

    Take an example: The main function has an fcall(22.7.19) instruction and that fcall instruction has two function implementations that could be called. When generating the microcode for the program to execute, the code is generated up to the fcall routine and the current state of the registers and other shader state is stored off in "StateBeforeCall". Then code is generated for the first function that can be called starting with the current state of register allocation, scratch registers, etc. Next the current state is restored to StateBeforeCall and the code for the second function is generated. Finally the current state is restored to StateBeforeCall again and the impacts of the outputs of the fcall are applied to the current state, and code generation continues after the fcall.

    Limitations are present in the IL that allow for the calling destination to have a version of a function’s microcode emitted using the current register knowledge of the caller to allocate the callee’s local registers after the caller’s registers so that no saving/restoring of data is required when crossing the function boundary.

    The downside from "real" subroutines is that the amount of code to represent the program can become quite large. No code sharing is done between multiple call sites. If code is larger than the code cache, and the miss latency is not hidden by some other mechanism, then "real" subroutines are very useful. Assuming that the code bloat size is minimal (i.e. each function is only ever called from one location), then performance will be better with the new method – no parameter passing overhead, inlining optimizations, etc.

    Another problem with the new method is that all destinations must be known at compile time. Due to validation that is currently done, all calls will be need to be known. As that requirement is relaxed, "real" subroutines are a better way of handling late binding destinations.

    HLSL requires that all texture and sampler parameters be rooted in some well-known global object so that the compiler can determine which texture or sampler index to use for a particular texture or sampler variable throughout the entire program. As fcalls constitute a late-binding boundary the compiler cannot easily track parameter identity and thus texture and sampler arguments to fcalls are not allowed. Note that when only concrete classes are used this isn’t a problem. Additionally, texture and sampler members of classes should be allowed, this limitation only applies to parameters to interface methods that are used with full fcall dispatch.

    Also see the related topics Uniform Indexing of Resources and Samplers(7.11) as well as the this[](22.7.20) register.

    7.19.3 Subroutines: Non-goals

    7.19.4 Subroutines - Instruction Reference

    7.19.5 Simple Example

    7.19.5.1 HLSL - Simple Example

        interface Light
        {
            float3 Calculate(float3 Position, float3 Normal);
        };
    
        class AmbientLight : Light
        {
            float3 Calculate(float3 Position, float3 Normal)
            {
                return AmbientValue;
            }
    
            float3 AmbientValue;
        };
    
        class DirectionalLight : Light
        {
            float3 Calculate(float3 Position, float3 Normal)
            {
                float3 LightDir = normalize(Position - LightPosition);
                float LightContrib = saturate( dot( Normal, -LightDir) );
                return LightColor * LightContrib;
            }
    
             float3 LightPosition;
            float3 LightColor;
        };
    
        AmbientLight MyAmbient;
        DirectionalLight MyDirectional;
    
        float4 main (Light MyInstance, float3 CurPos: CurPosition,
                     float3 Normal : Normal) : SV_Target
        {
            float4 Ret;
            Ret.xyz = MyInstance.Calculate(CurPos, Normal);
            Ret.w = 1.0;
    
            return Ret;
        }
    

    7.19.5.2 IL - Simple Example

        // Function table for AmbientLight.
        dcl_function_body fb0
        dcl_function_table ft0 = { fb0 }
    
        // Function table for DirectionalLight.
        dcl_function_body fb1
        dcl_function_table ft1 = { fb1 }
    
        // main's MyMaterial parameter.
        dcl_interface fp0[1][1] = { ft0, ft1 };
    
        // main shader code
    
        // call AmbientLight or DirectionalLight based on function pointer bound
        fcall fp0[0][0]
        mov o0.xyz, r0.xyzx
        mov o0.w, l(1.000000)
        ret
    
        // AmbientLight::Calculate
        label fb0
        mov r0.w, this[0].y
        mov r1.x, this[0].x
        mov r0.xyz, cb[r1.x + 0][r0.w + 0].xyzx
        ret
    
        // DirectionalLight::Calculate
        label fb1
        mov r0.w, this[0].y
        mov r1.xyz, this[0].xyxx
        add r1.yzw, v0.xxyz, -cb[r1.z + 0][r1.y + 0].xxyz
        dp3 r2.x, r1.yzwy, r1.yzwy
        rsq r2.x, r2.x
        mul r1.yzw, r1.yyzw, r2.xxxx
        dp3_sat r1.y, v1.xyzx, -r1.yzwy
        mul r1.xyz, r1.yyyy, cb[r1.x + 0][r0.w + 1].xyzx
        mov r0.xyz, r1.xyzx
        ret
    

    7.19.5.3 API - Simple Example

        //create the shader
        //    and specify the class library to load class instance info into
        pDevice->CreatePixelShader(pShaderCode, pMyClassLinkage, &pMyPS);
    
        //get a handle to the MyDirectional and MyAmbient class instances
        //    from the class library
        //the zero is an array index for when the variable is an array.
        pMyClassLinkage->
            GetClassInstance(L"MyDirectional", 0, &pMyDirectionalLight);
        pMyClassLibrary->
            GetClassInstance(L"MyAmbient", 0, & pMyAmbientLight);
    
        while (true)
        {
            // select either the MyDirectionalList or MyAmbient class
            if (DirectionalLighting)
                pDevice->PSSetShader(pMyPS, &pMyDirectionalLight, 1);
            else
                pDevice->PSSetShader(pMyPS, &pMyAmbientLight, 1);
    
            RenderScene();
        }
    

    7.19.6 Runtime API for Interfaces

    7.19.6.1 Overview

    The programming model for subroutines is an interface driven model. The interface provides the definition of the function tables that can be switched between efficiently. A level of data abstraction is also present to allow for swapping of both data and function pointers during SetShader calls. At SetShader time, an array of class instantiations is specified that correspond to the interfaces that are used by the shader. The shader reflection system specifies information for each entry in the required interface array. A runtime reflection API is required to be able to specify the class instance in a way that can be efficiently mapped by the runtime to function pointers for the driver calls to consume. The runtime API does not need to be complex, just a method of providing handles to class instances.

    The runtime API has only one goal: Provide a handle to SetShader that can be efficiently used to specify to the driver what functions should be executed for a given shader bind. To achieve this goal, a collection of class information is required if the class instance handles are to be shared across multiple shaders i.e. between all shaders within an effect. When a shader is created, a ID3D11ClassLinkage is a new parameter that specifies where to add the class metadata to. If the same class library is specified to two shaders, then the same class instance handles are used when binding either shader. The collection of class metadata could be global to a given device, but that could become cumbersome when mixing large collection of shaders (i. e. keeping a middleware solution separate from another middleware solution).

    7.19.6.2 Prototype of changes

        interface ID3D11ClassLinkage : IUnknown
        {
        // PRIMARY FUNCTION - get a reference to an instance of a class
        //    that exists in a shader.  The common scenario is to refer to
        //    variables declared in shaders, which means that a reference is
        //    acquired with this function and then passed in on SetShader
            HRESULT GetClassInstance(
                WCHAR *pszClassInstanceName,
                UINT uInstanceIndex,
                ID3D11ClassInstance **pClassInstance);
    
        //  Create a class instance reference that is the combination of a class
        //    type and the location of the data to use for the class instance
        //      - not the common scenario, but useful in case the data location
        //        for a class is dynamic or not known until runtime
            HRESULT CreateClassInstance(
                WCHAR *pszClassTypeName,
                UINT ConstantBufferOffset,
                UINT ConstantVectorOffset,
                UINT TextureOffset,
                UINT SamplerOffset,
                ID3D11ClassInstance **pClassInstance);
        }
    
        //  Specifying the calls in "10 speak".  Use the follow as an example
        //    of how one could retrofit D3D10 and then put that into the D3D11 API
        //    i.e. ignoring split of Creats off of device, new stages, etc.
        Interface ID3D11Device
        {
            [ … Existing calls … ]
    
        //  Shader create calls take a parameter to specify the class library
        //     to append the class symbol information from the shader into
        //     this is a NON-OPTIONAL parameter.  A shader is unusable without
        //     the funciton table information being used (assuming it has any)
    
            HRESULT CreateVertexShader(
                void *pShaderBytecode,
                SIZE_T BytecodeLength,
                ID3D11ClassLinkage *pClassLinkage,
                ID3D11VertexShader **ppVertexShader);
    
            HRESULT CreateGeometryShader(
                void *pShaderBytecode,
                SIZE_T BytecodeLength,
                ID3D11ClassLinkage *pClassLinkage,
                ID3D11VertexShader **ppVertexShader);
    
            HRESULT CreatePixelShader(
                void *pShaderBytecode,
                SIZE_T BytecodeLength,
                ID3D11ClassLinkage *pClassLinkage,
                ID3D11VertexShader **ppVertexShader);
    
        // Not shown: Similar to above for Hull Shader, Domain Shader and Compute Shader
    
            HRESULT CreateClassLinkage(
                ID3D11ClassLinkage **ppClassLinkage);
    
        //  Shader bind calls take an extra array to specify the function tables
        //      to use until the next bind shader call
    
            void VSSetShader(
                ID3D11VertexShader *pShader,
                ID3D11ClassInstance *ppClassInstances,
                UINT NumInstances);
    
            void GSSetShader(
                ID3D11GeometryShader *pShader,
                ID3D11ClassInstance *ppClassInstances,
                UINT NumInstances);
    
            void PSSetShader(
                ID3D11PixelShader *pShader,
                ID3D11ClassInstance *ppClassInstances,
                UINT NumInstances);
    
            // Not shown: Similar to above for Hull Shader, Domain Shader and Compute Shader
    
        }
    

    7.19.7 Complex Example

    7.19.7.1 HLSL - Complex Example

        interface Light
        {
            float3 Calculate(float3 Position, float3 Normal);
        };
    
        class AmbientLight : Light
        {
            float3 m_AmbientValue;
    
            float3 Calculate(float3 Position, float3 Normal)
            {
                return m_AmbientValue;
            }
        };
    
        class DirectionalLight : Light
        {
            float3 m_LightDir;
            float3 m_LightColor;
    
            float3 Calculate(float3 Position, float3 Normal)
            {
                float LightContrib = saturate( dot( Normal, -m_LightDir) );
                return m_LightColor * LightContrib;
            }
        };
    
        uint g_NumLights;
        uint g_LightsInUse[4];
        Light g_Lights[9];
    
        float3 AccumulateLighting(float3 Position, float3 Normal)
        {
            float3 Color = 0;
    
            for (uint i = 0; i < g_NumLights; i++)
            {
                Color += g_Lights[g_LightsInUse[i]].Calculate(Position, Normal);
            }
    
            return Color;
        }
    
        interface Material
        {
            void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord);
            float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord);
        };
    
        class FlatMaterial : Material
        {
            float3 m_Color;
    
            void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord)
            {
            }
            float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord)
            {
                return m_Color * AccumulateLighting(Position, Normal);
            }
        };
    
        class TexturedMaterial : Material
        {
            float3 m_Color;
            Texture2D<float3> m_Tex;
            sampler m_Sampler;
    
            void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord)
            {
            }
            float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord)
            {
                float3 Color = m_Color;
    
                Color *= m_Tex.Sample(m_Sampler, TexCoord) * 0.1234;
    
                Color *= AccumulateLighting(Position, Normal);
    
                return Color;
            }
        };
    
        class StrangeMaterial : Material
        {
            void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord)
            {
                Position += Normal * 0.1;
            }
            float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord)
    
            {
                return AccumulateLighting(Position, Normal);
            }
        };
    
        float TestValueFromLight(Light Obj, float3 Position, float3 Normal)
        {
            float3 Calc = Obj.Calculate(Position, Normal);
            return saturate(Calc.x + Calc.y + Calc.z);
        }
    
        AmbientLight g_Ambient0;
        DirectionalLight g_DirLight0;
        DirectionalLight g_DirLight1;
        DirectionalLight g_DirLight2;
        DirectionalLight g_DirLight3;
        DirectionalLight g_DirLight4;
        DirectionalLight g_DirLight5;
        DirectionalLight g_DirLight6;
        DirectionalLight g_DirLight7;
    
        FlatMaterial g_FlatMat0;
        TexturedMaterial g_TexMat0;
        StrangeMaterial g_StrangeMat0;
    
        float4 main (
            Material MyMaterial,
            float3 CurPos: CurPosition,
            float3 Normal : Normal,
            float2 TexCoord : TexCoord0) : SV_Target
        {
            float4 Ret;
    
            if (TestValueFromLight(g_DirLight0, CurPos, Normal) > 0.5)
            {
                MyMaterial.Perturb(CurPos, Normal, TexCoord);
            }
    
            Ret.xyz = MyMaterial.CalculateLitColor(CurPos, Normal, TexCoord);
            Ret.w = 1;
    
            return Ret;
        }
    

    7.19.7.2 IL - Complex Example

        //
        // This pointers are a four-element vector with indices for
        // which constant buffer holds the instance data (.x element),
        // the base offset of the instance data in the instance constant
        // buffer, the base texture index and the base sampler index.
        // Basic instance members will therefore be referenced with
        // cb[r0.x][r0.y + member_offset].
        // This pointers can be in arrays so the first [] index
        // can also have a register to indicate array access.
        //
    
        //
        // For this example assume that globals are put in cbuffers
        // in the following order.  Entries are offset:size in
        // register (four-component) units.
        //
        // cb0:
        //     0:1 - g_NumLights.
        //     1:4 - g_LightsInUse.
        //     5:1 - g_Ambient0.
        //     6:2 - g_DirLight0.
        //     8:2 - g_DirLight1.
        //    10:2 - g_DirLight2.
        //    12:2 - g_DirLight3.
        //    14:2 - g_DirLight4.
        //    16:2 - g_DirLight5.
        //    18:2 - g_DirLight6.
        //    20:2 - g_DirLight7.
        //    22:1 - g_FlatMat0.
        //    23:1 - g_TexMat0.
        //
        // g_StrangeMat0 takes no space.
        //
        // interfaces:
        //     0:1 - MyMaterial.
        //     1:9 - g_Lights.
        //
        // textures:
        //     0:1 - g_TexMat0.
        //
        // samplers:
        //     0:1 - g_TexMat0.
        //
        // The this pointers for the concrete objects would then be:
        // g_Ambient0:    { 0,  5, -, - }
        // g_DirLight0:   { 0,  6, -, - }
        // g_DirLight1:   { 0,  8, -, - }
        // g_DirLight2:   { 0, 10, -, - }
        // g_DirLight3:   { 0, 12, -, - }
        // g_DirLight4:   { 0, 14, -, - }
        // g_DirLight5:   { 0, 16, -, - }
        // g_DirLight6:   { 0, 18, -, - }
        // g_DirLight7:   { 0, 20, -, - }
        // g_FlatMat0:    { 0, 22, -, - }
        // g_TexMat0:     { 0, 23, 0, 0 }
        // g_StrangeMat0: { -,  -, -, - }
        //
    
        //
        // Function bodies are declared explicitly so
        // that it’s known in advance which bodies exist
        // and how many bodies there are overall.
        //
    
        dcl_function_body fb0
        dcl_function_body fb1
        dcl_function_body fb2
        dcl_function_body fb3
        dcl_function_body fb4
        dcl_function_body fb5
        dcl_function_body fb6
        dcl_function_body fb7
        dcl_function_body fb8
        dcl_function_body fb9
        dcl_function_body fb10
        dcl_function_body fb11
    
        //
        // Function tables work similarly to vtables for C++ except
        // that a table has an entry per call site for an interface
        // instead of per method.
        //
    
        // Function table for AmbientLight.
        // One call site in AccumulateLighting multiplied by three calls of
        // AccumulateLighting from CalculateLitColor.
        dcl_function_table ft0 { fb3, fb6, fb9 }
    
        // Function table for DirectionalLight.
        // One call site in AccumulateLighting multiplied by three calls of
        // AccumulateLighting from CalculateLitColor.
        dcl_function_table ft1 { fb4, fb7, fb10 }
    
        // Function table for FlatMaterial.
        // One call to Perturb in main and one call to CalculateLitColor in main.
        dcl_function_table ft2 { fb0, fb5 }
    
        // Function table for TexturedMaterial.
        // One call to Perturb in main and one call to CalculateLitColor in main.
        dcl_function_table ft3 { fb1, fb8 }
    
        // Function table for StrangeMaterial.
        // One call to Perturb in main and one call to CalculateLitColor in main.
        dcl_function_table ft4 { fb2, fb11 }
    
        //
        // Function table pointers.  Each of these needs to bound before
        // the shader is usable.  The idea is that binding gives
        // a reference to one of the function tables above so that
        // the method slots can be filled in.
        // The compiler will not generate pointers for unreferenced objects.
        //
        // A function table pointer has a full set of method slots to
        // avoid the extra level of indirection that a C++ pointer-to-
        // pointer-to-vtable representation would require (that would also
        // require that this pointers be 5-tuples).  In the HLSL virtual
        // inlining model it's always known what global variable/input is
        // used for a call so we can set up tables per root object.
        //
        // Function pointer decls indicate which function tables are
        // legal to use with them.  This also allows derivation of
        // method correlation information.
        //
        // The first [] of an interface decl is the array size.
        // If dynamic indexing is used the decl will indicate
        // that, as shown below.  An array of interface pointers can
        // be indexed statically also, it isn’t required that
        // arrays of interface pointers mean dynamic indexing.
        //
        // Numbering of interface pointers takes array size into
        // account, so the first pointer after a four entry
        // array fp6[4][1] would be fp10.
        //
        // The second [] of an interface decl is the number
        // of call sites, which must match the number of bodies in
        // each table referenced in the decl.
        //
    
        // main's MyMaterial parameter.
        dcl_interface fp0[1][2] = { ft2, ft3, ft4 };
    
        // g_Lights entries.
        dcl_interface_dynamicindexed fp1[9][3] = { ft0, ft1 };
    
        // main routine.
    
        // TestValueFromLight is a regular routine and is inlined.
        // The Calculate reference inside of it is passed the concrete
        // instance DirLight0 so it is devirtualized and inlined.
        dp3_sat r0.x, v1.xyzx, -cb0[6].xyzx
        mul r0.yz, r0.xxxx, cb0[7].xxyx
        add r0.y, r0.z, r0.y
        mad_sat r0.x, cb0[7].z, r0.x, r0.y
    
        // The return of TestValueFromLight is tested.
        lt r0.x, l(0.500000), r0.x
        if_nz r0.x
    
          // The call to Perturb is a full fcall
          fcall fp0[0][0]
          mov r2.xyz, r0.xyzx
          mov r0.x, r0.w
          mov r0.y, r1.x
    
        else
    
          mov r2.xyz, v1.xyzx
          mov r0.xy, v2.xyxx
    
        endif
    
        // The call to CalculateLitColor is a full fcall.
        fcall fp0[0][1]
    
        mov o0.xyz, r1.xyzx
        mov o0.w, l(1.000000)
        ret
    
        //
        // Function bodies.
        //
    
        // FlatMaterial version of main's call to Perturb.
        label fb0
        mov r0.xyz, v1.xyzx
        mov r0.w, v2.y
        mov r1.x, v2.x
        ret
    
        // TexturedMaterial version of main's call to Perturb.
        label fb1
        mov r0.xyz, v1.xyzx
        mov r0.w, v2.x
        mov r1.x, v2.y
        ret
    
        // StrangeMaterial version of main's call to Perturb.
        // NOTE: Position is not used later so the compiler has killed
        // the update to Position from this body.
        label fb2
        mov r0.xyz, v1.xyzx
        mov r0.w, v2.x
        mov r1.x, v2.y
        ret
    
        // AmbientLight version of FlatMaterial.CalculateLitColor-calls-
        // AccumulateLighting's call to Calculate.
        // NOTE: the Calculate bodies all look superficially
        // identical but all are different.  In one case
        // the array index is r1 and the return value is r4,
        // in one case the array index is r1 and the return value
        // is r5 and in the last case the array index is in r0
        // and the return is in r5.  Bodies are not interchangeable.
        label fb3
        // Array index is r1, return is r4.
        mov r2.w, this[r1.w + 1].y
        mov r1.w, this[r1.w + 1].x
        mov r4.xyz, cb[r1.w + 0][r2.w + 0].xyzx
        ret
    
        // DirectionalLight version of FlatMaterial.CalculateLitColor-calls-
        // AccumulateLighting's call to Calculate.
        label fb4
        // Array index is r1, return is r4.
        mov r2.w, this[r1.w + 1].y
        mov r3.w, this[r1.w + 1].x
        mov r4.w, this[r1.w + 1].y
        mov r5.x, this[r1.w + 1].x
        dp3_sat r4.w, r2.xyzx, -cb[r5.x + 0][r4.w + 0].xyzx
        mul r5.xyz, r4.wwww, cb[r3.w + 0][r2.w + 1].xyzx
        mov r4.xyz, r5.xyzx
        ret
    
        // FlatMaterial version of main's call to CalculateLitColor.
        label fb5
    
        // AccumulateLighting is inlined.
        mov r3.xyz, l(0,0,0,0)
        mov r0.w, l(0)
    
        loop
          // g_NumLights is cb0[0].
          uge r1.w, r0.w, cb0[0].x
          breakc_nz r1.w
    
          // Get g_Lights[g_LightsInUse[i]].
          // g_LightsInUse is cb0[1-4].
          // g_Lights is cb0[5-13].
          mov r1.w, cb0[r0.w + 1].x
    
          // Call Calculate.  Array index is r1.
          fcall fp1[r1.w + 0][0]
    
          // Return is expected in r4.
          mov r0.xyz, r4.xyzx
          add r3.xyz, r3.xyzx, r0.xyzx
          iadd r0.w, r0.w, l(1)
        endloop
    
        // Multiply times color.
        mov r0.xy, this[0].yxyy
        mul r0.xyz, r3.xyzx, cb[r0.y + 0][r0.x + 0].xyzx
        mov r1.xyz, r0.xyzx
        ret
    
        // AmbientLight version of TexturedMaterial.CalculateLitColor-calls-
        // AccumulateLighting's call to Calculate.
        label fb6
        // Array index is r1, return is r5.
        mov r2.w, this[r1.w + 1].y
        mov r1.w, this[r1.w + 1].x
        mov r5.xyz, cb[r1.w + 0][r2.w + 0].xyzx
        ret
    
        // DirectionalLight version of TexturedMaterial.CalculateLitColor-calls-
        // AccumulateLighting's call to Calculate.
        label fb7
        // Array index is r1, return is r5.
        mov r2.w, this[r1.w + 1].y
        mov r3.w, this[r1.w + 1].x
        mov r4.w, this[r1.w + 1].y
        mov r5.w, this[r1.w + 1].x
        dp3_sat r4.w, r2.xyzx, -cb[r5.w + 0][r4.w + 0].xyzx
        mul r6.xyz, r4.wwww, cb[r3.w + 0][r2.w + 1].xyzx
        mov r5.xyz, r6.xyzx
        ret
    
        // TexturedMaterial version of main's call to CalculateLitColor.
        label fb8
    
        // Texture sample.
        mov r4.xy, this[0].zw
        sample r0.xyz, v2.xy, t[r4.x].xyz, s[r4.y]
        mul r0.xyz, r0.xyzx, l(0.123400, 0.123400, 0.123400, 0.000000)
    
        // m_Color multiplied by texture sample.
        mov r0.w, this[0].y
        mov r1.w, this[0].x
        mul r0.xyz, r0.xyzx, cb[r1.w + 0][r0.w + 0].xyzx
    
        // AccumulateLighting is inlined.
        mov r4.xyz, l(0,0,0,0)
        mov r0.w, l(0)
        loop
          // g_NumLights is cb0[0].
          uge r1.w, r0.w, cb0[0].x
          breakc_nz r1.w
    
          // Get g_Lights[g_LightsInUse[i]].
          // g_LightsInUse is cb0[1-4].
          // g_Lights is cb0[5-13].
          mov r1.w, cb0[r0.w + 1].x
    
          // Call Calculate.  Array index is in r1.
          fcall fp1[r1.w + 0][1]
    
          // Return is expected in r5.
          mov r3.xyz, r5.xyzx
          add r4.xyz, r4.xyzx, r3.xyzx
          iadd r0.w, r0.w, l(1)
        endloop
    
        // Multiply accumulated color times texture color.
        mul r0.xyz, r0.xyzx, r4.xyzx
        mov r1.xyz, r0.xyzx
        ret
    
        // AmbientLight version of StrangeMaterial.CalculateLitColor-calls-
        // AccumulateLighting's call to Calculate.
        label fb9
        // Array index is r0, return is r5.
        mov r1.w, this[r0.w + 1].y
        mov r0.w, this[r0.w + 1].x
        mov r5.xyz, cb[r0.w + 0][r1.w + 0].xyzx
        ret
    
        // DirectionalLight version of StrangeMaterial.CalculateLitColor-calls-
        // AccumulateLighting's call to Calculate.
        label fb10
        // Array index is r0, return is r5.
        mov r1.w, this[r0.w + 1].y
        mov r2.w, this[r0.w + 1].x
        mov r3.w, this[r0.w + 1].y
        mov r4.w, this[r0.w + 1].x
        dp3_sat r3.w, r2.xyzx, -cb[r4.w + 0][r3.w + 0].xyzx
        mul r6.xyz, r3.wwww, cb[r2.w + 0][r1.w + 1].xyzx
        mov r5.xyz, r6.xyzx
        ret
    
        // StrangeMaterial version of main's call to CalculateLitColor.
        label fb11
    
        // AccumulateLighting is inlined.
        mov r4.xyz, l(0,0,0,0)
        mov r0.z, l(0)
    
        loop
          // g_NumLights is cb0[0].x.
          uge r0.w, r0.z, cb0[0].x
          breakc_nz r0.w
    
          // Get g_Lights[g_LightsInUse[i]].
          // g_LightsInUse is cb0[1-4].
          // g_Lights is cb0[5-13].
          mov r0.w, cb0[r0.z + 1].x
    
          // Call Calculate.  Array index is in r0.
          fcall fp1[r0.w + 0][2]
    
          // Return is in r5.
          mov r3.xyz, r5.xyzx
          add r4.xyz, r4.xyzx, r3.xyzx
          iadd r0.z, r0.z, l(1)
        endloop
        mov r1.xyz, r4.xyzx
        ret
    

    7.19.7.3 API - Complex Example

        // create a class library to hold class instance data
        pDevice->CreateClassLinkage(&pMyClassTable);
    
        // create the shader and supply a class library to add class instance data
        pDevice->
            CreatePixelShader(pMyCompiledPixelShader, pMyClassLinkage, &pMyPS);
    
        // use reflection to get where data should be stored in interface array
        NumInterfaces = pMyPSReflection->GetNumInterfaces();
        pMyLightsVar = pMyPSReflection->GetVariableByName("g_Lights");
        iLightOffset = pMyLightsVar->GetInterfaceSlot(0);
        pMyMaterialVar = pMyPSReflection->GetVariableByName("$MyMaterial");
        iMatOffset = pMyPSReflection->GetInterfaceSlot(0);
    
        // Use class library to get references to all class instances
        //   needed in the shader.
        pMyClassTable->GetClassInstance("g_Ambient0", 0, &pAmbient0);
        pMyClassTable->GetClassInstance("g_DirLight0", &pDirLight[0]);
        pMyClassTable->GetClassInstance("g_DirLight1", &pDirLight[1]);
        pMyClassTable->GetClassInstance("g_DirLight2", &pDirLight[2]);
        pMyClassTable->GetClassInstance("g_DirLight3", &pDirLight[3]);
        pMyClassTable->GetClassInstance("g_DirLight4", &pDirLight[4]);
        pMyClassTable->GetClassInstance("g_DirLight5", &pDirLight[5]);
        pMyClassTable->GetClassInstance("g_DirLight6", &pDirLight[6]);
        pMyClassTable->GetClassInstance("g_DirLight7", &pDirLight[7]);
        pMyClassTable->GetClassInstance("g_FlatMat0", &pFlatMat0);
        pMyClassTable->GetClassInstance("g_TexMat0", &pTexMat0);
        pMyClassTable->GetClassInstance("g_StrangeMat0", &pStrangeMat0);
    
        // sets lights in array - they do not change only indices to them do
        pMyInterfaceArray[iLightOffset] = pAmbient0;
        for (uint i = 0; i < 8; i++)
        {
            pMyInterfaceArray[iLightOffset + i + 1] = pDirLight[i];
        }
    
        while (true)
        {
            if (bFlatSunlightOnly)
            {
                // Set g_NumLights to 1 in constant buffer.
                // Set g_LightsInUse[0] to 1 in constant buffer.
                pMyInterfaceArray[iMatOffset] = pFlatMat0;
            }
            else if (bStrangeMaterials)
            {
                // Set g_NumLights and fill out g_LightsInUse.
                pMyInterfaceArray[iMatOffset] = pStrangeMat0;
            }
            else
            {
                // Set g_NumLights and fill out g_LightsInUse.
                pMyInterfaceArray[iMatOffset] = pTexMat0;
            }
    
           // Set the pixel shader and the interfaces to until the next bind call
            pDevice->PSSetShader(pMyPS, pMyInterfaceArray, NumInterfaces);
    
            // Use the shader that was just bound to draw something
            RenderScene();
        }
    

    7.20 Low Precision Shader Support in D3D


    Section Contents

    (back to chapter)

    7.20.1 Overview

    7.20.1.1 Design Goals / Assumptions
    7.20.2 Precision Levels
    7.20.2.1 10-bit min precision level
    7.20.2.2 16-bit min-precision level
    7.20.2.2.1 float16
    7.20.2.3 int16/uint16
    7.20.3 Low Precision Shader Bytecode
    7.20.3.1 D3D9
    7.20.3.1.1 Token Format
    7.20.3.1.2 Usage Cases
    7.20.3.1.3 Interpreting Minimum Precision
    7.20.3.2 D3D10+
    7.20.3.2.1 Token Format
    7.20.3.3 Usage Cases
    7.20.3.4 Interpreting Precision (same for D3D9 and D3D10+)
    7.20.3.5 Shader Constants
    7.20.3.6 Referencing Shader Constants within Shaders
    7.20.3.7 Component Swizzling
    7.20.3.8 Low Precision Shader Limits
    7.20.4 Feature Exposure
    7.20.4.1 Discoverability
    7.20.4.2 Shader Management
    7.20.4.3 APIs/DDIs
    7.20.4.4 HLSL Exposure


    7.20.1 Overview

    This adds support for 10bit (2.8 fixed point) and 16bit precision float and in some cases limited integer arithmetic to shader model 2.0+.

    Shader<->memory I/O operations are unchanged for simplicity, e.g. shader constants continue to be defined as 32-bit per component.

    Implementations are allowed to execute low precision operations at higher precision. So 10-bit arithmetic could be done at 10-bits or more (say 32-bit) precision.

    7.20.1.1 Design Goals / Assumptions


    7.20.2 Precision Levels

    The new 10 and 16 bit precision levels for shaders are inspired by their existence in some real hardware and their presence in OpenGL ES. (8 bit was considered but cut due to its limitations versus the value it seemed to provide at the time).

    Default PrecisionMin 10-bit fixed point (2.8)Min 16-bit int / float32-bit int/float64-bit float
    Executing at higher precision allowed?-YYNN
    Shader Constants-NNYY
    SM 2.xVS: fp32 / int23
    PS: fp24 (s16e7) / int 16
    optoptNN
    SM 3.0fp32NNYN
    SM 4.xfp32 / int32optoptYopt
    SM 5.0fp32 / int32optoptYopt
    Float range-[-2,2)[-214,214]Full IEEE 754Full IEEE 754
    Float magnitude range-2-8...2On SM 4+,
    includes INF/NAN
    Full IEEE 754Full IEEE 754
    Int range--(-211,211),
    Full range signed
    and unsigned on SM4+
    full-

    7.20.2.1 10-bit min precision level

    This is a 2.8 fixed point value, though the fixed point semantics may not be identical to the general fixed point semantics defined in the D3D10+ specs. Following the D3D10+ fixed point semantics is recommended for future hardware that may choose to implement the 10-bit precision level.

    8-bit UNORM data is invertable when passed through 10-bit min-precision storage. For example: Suppose UNORM 8-bit data that is point sampled from the texture format DXGI_FORMAT_R8G8B8A8_UNORM gets read into a shader and is stored and passed around in the 10-bit representation. If that data s subsequently written unchanged out to a UNORM 8-bit output (such as a DXGI_FORMAT_R8G8B8A8_UNORM rendertarget) the output UNORM value matches the input UNORM value. This guarantee does not (cannot) apply for other formats passing through 10-bit, such as 8-bit UNORM_SRGB or higher precision UNORM values like 16-bit UNORM.

    From the shader point of view the 10-bit min-precision level this appears as a float value with at minimum [-2,2) range.

    Hardware that supports 10-bit precision must also support 16-bit precision.

    7.20.2.2 16-bit min-precision level

    7.20.2.2.1 float16

    For float values, this is float 16 as defined in the D3D10+ specs. The exception is that for Shader Models 2, the max. exponent encoding (normally defining NaN/INF) are unused (undefined).

    Conversion from float32 (e.g. from shader constants) to float16 may or may not flush float16 denorm to 0, and round to zero is used, per D3D spec for high to low precision float. Float16 arithmetic operations within the shader may or may not flush float16 denorm to 0, and may either round to nearest even or truncate to a representable number. Out of range values in conversion from float32 or arithmetic may produce +/-MAX_FLOAT16 or +/- INF.

    16-bit integer min-precision is available as well in HLSL. For Shader Models 2, this is constrained to be representable as integral floats (1.0f, 2.0f, etc.) in a float16 encoding. In the shader bytecode these appear simply as float16, so native integer operations are not available. (it may not be worth bothering to expose this constrained form of int16 for SM 2/3)

    7.20.2.3 int16/uint16

    For shader model 4+, native integer ops can be used on 16-bit min-precision values, however applications must beware that the device could choose to simply use larger-than-16-bit (e.g. 32 bit) integer ops without any clamping to maintain the illusion that there are not more than 16 bits present.

    Shader Constants feeding 16-bit shader arithmetic are always fp32 encoded for Shader Model 2. For Shader Models 4+, Shader Constants feeding 16-bit in the shader are specified as float32 or UINT32/INT32 as appropriate (i.e. unchanged from the way constants feed into float32 arithmetic).

    7.20.3 Low Precision Shader Bytecode

    7.20.3.1 D3D9

    A new MIN_PRECISION enum is added to the source and dest parameter token, definition below. This specifies the minimum precision level for the entire operation – implementations can use equal or greater precision. This new enum co-exists with the PARTIALPRECISION flag that is already in the same dest parameter token – see the comment below.

    7.20.3.1.1 Token Format
    // Source or dest token bits [15:14]:
    #define D3D11_SB_OPERAND_MIN_PRECISION_MASK  0x0001C000
    #define D3D11_SB_OPERAND_MIN_PRECISION_SHIFT 14
    
    typedef enum _D3DSHADER_MIN_PRECISION
    {
        D3DMP_DEFAULT   = 0, // Default precision for the shader model
        D3DMP_16        = 1, // Min 16 bit per component
        D3DMP_2_8       = 2, // Min 10 bits (2.8) per component
    } D3DSHADER_MIN_PRECISION;
    // When MIN_PRECISION is nonzero on a dest token, the dest modifier
    // D3DSPDM_PARTIALPRECISION must also be set for consistency
    //
    // If D3DSPDM_PARTIALPRECISION is set but
    // D3DSHADER_MIN_PRECISION is D3DMP_DEFAULT(0),
    // it is equivalent to D3DSPDM_PARTIALPRECISION + D3DMP_16
    // (partial PARTIALPRECISION existed before MIN_PRECISION was
    // added, so this defines how the two can coexist without changing
    // meaning for old shaders)
    
    7.20.3.1.2 Usage Cases

    The src/dest token for instructions in PS/VS 2.x can use the MIN_PRECISION enum in the following circumstances:

    7.20.3.1.3 Interpreting Minimum Precision

    7.20.3.2 D3D10+

    A new MIN_PRECISION enum is added to the dest parameter token, definition below. This specifies the minimum precision level for the entire operation – implementations can use equal or greater precision.

    The encoding distinguishes type (e.g. float vs. sint vs. uint), in addition to precision level, to disambiguate instructions like “mov” that don’t already imply a type. This makes a difference when there is a size change involved in the instruction. E.g. moving a 32 bit float to a min. 16 bit float is a different task for hardware than moving a 32 bit uint to a min. 16 bit uint. This type distinction is not needed for the D3D9 shader bytecode because all arithmetic is “float” there.

    7.20.3.2.1 Token Format
    // Min precision specifier for source/dest operands.  This
    // fits in the extended operand token field. Implementations are free to
    // execute at higher precision than the min – details spec’d elsewhere.
    // This is part of the opcode specific control range.
    typedef enum D3D11_SB_OPERAND_MIN_PRECISION
    {
        D3D11_SB_OPERAND_MIN_PRECISION_DEFAULT    = 0, // Default precision
                                                           // for the shader model
        D3D11_SB_OPERAND_MIN_PRECISION_FLOAT_16   = 1, // Min 16 bit/component float
        D3D11_SB_OPERAND_MIN_PRECISION_FLOAT_2_8  = 2, // Min 10(2.8)bit/comp. float
        D3D11_SB_OPERAND_MIN_PRECISION_SINT_16    = 4, // Min 16 bit/comp. signed integer
        D3D11_SB_OPERAND_MIN_PRECISION_UINT_16    = 5, // Min 16 bit/comp. unsigned integer
    } D3D11_SB_OPERAND_MIN_PRECISION;
    #define D3D11_SB_OPERAND_MIN_PRECISION_MASK  0x0001C000
    #define D3D11_SB_OPERAND_MIN_PRECISION_SHIFT 14
    
    // DECODER MACRO: For an OperandToken1 that can specify
    // a minimum precision for execution, find out what it is.
    #define DECODE_D3D11_SB_OPERAND_MIN_PRECISION(OperandToken1) ((D3D11_ SB_OPERAND_MIN_PRECISION)(((OperandToken1)& D3D11_SB_OPERAND_MIN_PRECISION_MASK)>> D3D11_SB_OPERAND_MIN_PRECISION_SHIFT))
    
    // ENCODER MACRO: Encode minimum precision for execution
    // into the extended operand token, OperandToken1
    #define ENCODE_D3D11_SB_OPERAND_MIN_PRECISION(MinPrecision) (((MinPrecision)<< D3D11_SB_OPERAND_MIN_PRECISION_SHIFT)& D3D11_SB_OPERAND_MIN_PRECISION_MASK)
    
    // ----------------------------------------------------------------------------
    // Global Flags Declaration
    //
    // OpcodeToken0:
    //
    ... snip ...
    
    // [16:16] Enable minimum-precision data types
    
    ... snip ...
    
    //
    // OpcodeToken0 is followed by no operands.
    //
    // ----------------------------------------------------------------------------
    ... snip ...
    #define D3D11_1_SB_GLOBAL_FLAG_ENABLE_MINIMUM_PRECISION        (1<<16)
    ... snip ...
    
    // DECODER MACRO: Get global flags
    #define DECODE_D3D10_SB_GLOBAL_FLAGS(OpcodeToken0) ((OpcodeToken0)&D3D10_SB_GLOBAL_FLAGS_MASK)
    
    // ENCODER MACRO: Encode global flags
    #define ENCODE_D3D10_SB_GLOBAL_FLAGS(Flags) ((Flags)&D3D10_SB_GLOBAL_FLAGS_MASK)
    
    

    7.20.3.3 Usage Cases

    The dest and source operand tokens in SM 4.0+ can use the MIN_PRECISION enum in the following circumstances:

    7.20.3.4 Interpreting Precision (same for D3D9 and D3D10+)

    7.20.3.5 Shader Constants

    Shader constants are defined at full 32-bit per component. New hardware implementing low precision is encouraged to design efficient downconversion support upon constant access, otherwise some driver work or extra conversion instructions will need to be added by the driver into shaders that read 32-bit per component constants into lower precision shader operations.

    Alternative approaches were considered where low precision constants are exposed all the way to the application (freeing driver/hardware from having to convert constants), but the added complexity in the programming model vs the benefit didn’t hold up at least at this time.

    7.20.3.6 Referencing Shader Constants within Shaders

    When referencing a shader constant from a low precision instruction, if the constant value is out of the range of the instruction’s precision level, the value read is undefined. For constant values within range of a low precision instruction reference, the precision of the value may still get quantized down from full 32 bits.

    Shader constants referenced in shader source operands will be marked at the precision they are to be referenced at, even though they come down the API/DDI at 32-bit per component.

    7.20.3.7 Component Swizzling

    Low precision data is referenced by component in masks and swizzles – xyzw - just like default precision data. It is as though the registers do have a smaller number of bits (for hardware that supports lower precision). This is unlike the way double precision is mapped, where xy contains one double and zw contains another. Low precision doesn’t yield sub-fields within .x for example.

    The HLSL compiler will not generate code that mixes precisions in different components of any xyzw register (mostly for simplicity, even though this may not matter for hardware).

    7.20.3.8 Low Precision Shader Limits

    The use of min / low precision specifiers never increases the maximum amount of resources available to a shader (such as limits on inputs, outputs or temp storage), since the shader must always be able to function on hardware that does not operate at low precision.

    7.20.4 Feature Exposure

    In the D3D system, HLSL shaders are compiled independent of any given device – e.g. they should typically be compiled offline. This compilation step produces device-agnostic bytecode, apart from the choice of shader target, e.g. vs_4_0.

    The minimum precision facility described above can be optionally used within any 4_0+ shader, including 4_0_level_9_1 to 4_0_level9_3. These shader targets are all available through the D3D11 runtime, exposing D3D9+ hardware via Shader Model 2_x+. The D3D9 runtime will not expose the low precision modes – updating that runtime is out of scope.

    7.20.4.1 Discoverability

    There is a mechanism at the API to discover the precision levels supported by the current device. Note that in Windows 8 the OS did not allow drivers to expose only 10 bit without also exposing 16 bit, but subsequent operating systems relax that requirement (so an implementation may expose 10 bit min precision but not 16 bit min precision).

    Even though the hardware’s precision support is visible to applications, applications do not have to adjust their shaders for the hardware’s precision level given that by definition operations defined with a min precision run at higher precision on hardware that doesn’t support the min precision.

    It is fine for hardware to not support low precision processing at all – by simply reporting “DEFAULT” as its precision support. The reason it is called “DEFAULT” rather than some numerical precision is depending on the shader model, there may not be standard value to express. E.g. the default precision in SM 2.x is fp24 (or greater) within the shader, even though there is no API visible fp24 format. If the device reports “DEFAULT” precision, all min-precision specifiers in shaders are ignored.

    D3D9 devices are permitted to report a min-precision level that is lower for the Pixel Shader than for the Vertex Shader (all reported via the Windows Next D3D9 DDI). D3D10+ devices can only report a single min-precision level that applies to all shader stages (reported via the Windows Next D3D11.1 DDI) – since it does not seem to make sense to single out the VS any more. Note that if the application uses Feature Level 9_x on D3D10+ hardware, the D3D9 DDIs are still used, so the min-precision levels can be reported differently there between VS and PS, as mentioned for D3D9, even though via the D3D11.1 DDI only a single precision can be reported.

    7.20.4.2 Shader Management

    Regardless of the min precision level supported by a given device, it is always valid to use a shader that was compiled using any combination of the low precision levels on it. For example if a device’s min precision level is 32-bit, it is fine to use a shader compiled with some variables that have a min precision of 10 bit. The device is free to implement the low precision operations at any equal or higher precision level (including precision levels not available at the API).

    For old drivers (pre-D3D11.1 DDI) that are not aware of the low precision feature, the D3D runtime will patch the shader bytecode on shader creation to remove it. This preserves the intent of the shader, since it is valid for the device to execute operations tagged with a min precision level at a higher precision.

    7.20.4.3 APIs/DDIs

    An API for reporting device precision support, no other D3D11 API surface area changes apply.

    As far as other DDI additions, there is device precision reporting, the shader bytecode additions detailed earlier, and finally a variant of the existing shader stage I/O signature DDI:

    The I/O signature DDI includes MinPrecision in the signature entry. This shows up as D3D11_SB_INSTRUCTION_MIN_PRECISION_DEFAULT if the shader didn’t specify a min-precision:

    typedef struct D3D11_1DDIARG_SIGNATURE_ENTRY
    {
        D3D10_SB_NAME SystemValue; // D3D10_SB_NAME_UNDEFINED if the particular entry doesn't have a system name.
        UINT Register;
        BYTE Mask;// (D3D10_SB_OPERAND_4_COMPONENT_MASK >> 4), meaning 4 LSBs are xyzw respectively
        D3D11_SB_INSTRUCTION_MIN_PRECISION MinPrecision;
    } D3D11_1DDIARG_SIGNATURE_ENTRY;
    
    typedef struct D3D11_1DDIARG_STAGE_IO_SIGNATURES
    {
        D3D11_1DDIARG_SIGNATURE_ENTRY*  pInputSignature;
        UINT                            NumInputSignatureEntries;
        D3D11_1DDIARG_SIGNATURE_ENTRY*  pOutputSignature;
        UINT                            NumOutputSignatureEntries;
    } D3D11_1DDIARG_STAGE_IO_SIGNATURES;
    

    Motivation: Recall that this DDI exists to complement the shader creation DDIs by providing a more complete picture of the shader stage<->stage I/O layout than may be visible just from an individual shader’s bytecode. For example sometimes an upstream stage provides data not consumed by a downstream shader, but it should be possible for a driver to compile a shader on its own without having to wait and see what other shaders it gets used with. MinPrecision is added in case that affects how the driver shader compiler would want to pack the inter-stage I/O data.

    7.20.4.4 HLSL Exposure

    Out of scope for this spec.


    8 Input Assembler Stage


    Chapter Contents

    (back to top)

    8.1 IA State
    8.2 Drawing Commands
    8.3 Draw()
    8.4 DrawInstanced()
    8.5 DrawIndexed()
    8.6 DrawIndexedInstanced()
    8.7 DrawInstancedIndirect()
    8.8 DrawIndexedInstancedIndirect()
    8.9 DrawAuto()
    8.10 Primitive Topologies
    8.11 Patch Topologies
    8.12 Generating Multiple Strips
    8.13 Partially Completed Primitives
    8.14 Leading Vertex
    8.15 Adjacency
    8.16 VertexID
    8.17 PrimitiveID
    8.18 InstanceID
    8.19 Misc. IA Issues
    8.20 Input Assembler Data Conversion During Fetching
    8.21 IA Example


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    An overview of the IA is at the beginning(2.1) of the document. This section provides implementation details more like they are viewed from the DDI perspective (exact parameter names may not match). The API view is different, in that instead of hardcoding shader register numbers in the state declaration, names are used, and when creating Input Assembler State objects, the runtime figures out which registers the names correspond based on a shader input signature definition.

    An illustrated example of the IA being used is at the end(8.21) of this section.


    8.1 IA State


    Section Contents

    (back to chapter)

    8.1.1 Overview
    8.1.2 Primitive Topology Selection
    8.1.3 Input Layout
    8.1.4 Resource Bindings


    8.1.1 Overview

    The states defining the Input Assembler's operation are described here. Draw*() commands on the Device, described below(8.2), use the currently active IA state to define most of their behavior.

    8.1.2 Primitive Topology Selection

    The following enumeration lists the various Primitive Topologies(8.10) available to the IA.

    
    typedef enum D3D11_PRIMITIVE_TOPOLOGY {
        D3D11_PRIMITIVE_TOPOLOGY_ILLEGAL               = 0, // Cannot use this value.
        D3D11_PRIMITIVE_TOPOLOGY_POINTLIST             = 1,
        D3D11_PRIMITIVE_TOPOLOGY_LINELIST              = 2,
        D3D11_PRIMITIVE_TOPOLOGY_LINESTRIP             = 3,
        D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST          = 4,
        D3D11_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP         = 5,
        // 6 is reserved (legacy triangle fan)
        // 7, 8 and 9 are also reserved
        D3D11_PRIMITIVE_TOPOLOGY_LINELIST_ADJ          = 10,  // start _ADJ at 10,
        D3D11_PRIMITIVE_TOPOLOGY_LINESTRIP_ADJ         = 11,  // so bit 3 can encode adjacency
        D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST_ADJ      = 12,
        D3D11_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP_ADJ     = 13,
        D3D11_PRIMITIVE_TOPOLOGY_1_CONTROL_POINT_PATCHLIST = 17,
        D3D11_PRIMITIVE_TOPOLOGY_2_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_3_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_4_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_5_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_6_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_7_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_8_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_9_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_10_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_11_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_12_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_13_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_14_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_15_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_16_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_17_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_18_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_19_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_20_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_21_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_22_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_23_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_24_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_25_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_26_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_27_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_28_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_29_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_30_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_31_CONTROL_POINT_PATCHLIST,
        D3D11_PRIMITIVE_TOPOLOGY_32_CONTROL_POINT_PATCHLIST
    } D3D11_PRIMITIVE_TOPOLOGY;
    
    

    The current primitive topology for the IA is defined by the following method:

    
    IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY PrimitiveTopology)
    
    

    8.1.3 Input Layout

    The following enumerations are used to build declarations of 1D Buffer structure layout. Structure fields are defined with format and offset, plus a target register. Multiple elements (from one or more structures) can not feed a single register.

    
    typedef enum D3D11_INPUT_CLASSIFICATION
    {
        D3D11_INPUT_PER_VERTEX_DATA    = 0,
        D3D11_INPUT_PER_INSTANCE_DATA  = 1
    } D3D11_INPUT_CLASSIFICATION;
    
    
    typedef struct D3D11_INPUT_ELEMENT_DESC
    {
        UINT InputSlot;
        UINT ByteOffset;
        DXGI_FORMAT Format;
        D3D11_INPUT_CLASSIFICATION InputSlotClass; // must be same for all Elements at same InputSlot
        UINT InstanceDataStepRate;   // InstanceDataStepRate is how many
                                     // Instances to draw before stepping one
                                     // unit forward in a VertexBuffer containing
                                     // Instance Data.
                                     // InstanceDataStepRate must be 0 and is
                                     // not used when InputSlotClass == D3D11_INPUT_PER_VERTEX_DATA.
                                     // But when Class == D3D11_INPUT_PER_INSTANCE_DATA,
                                     // InstanceDataStepRate can be any value, including 0.
                                     // 0 takes special meaning, that the instance data
                                     // should never be stepped at all.
                                     // This must be the same for all Elements at same InputSlot
    
        UINT InputRegister; // Which register in the set of
                                                         // inputs to the first active Pipeline
                                                         // stage this Element is going to.
    } D3D11_INPUT_ELEMENT_DESC;
    

    The following command creates an input layout.

    
    CreateInputLayout(
        const  D3D11_INPUT_ELEMENT_DESC* pDeclaration,
        SIZE_T NumElements,
        ID3D10InputLayout **ppInputLayout);
    
    

    8.1.4 Resource Bindings

    The following methods bind input vertex buffer(s) to the IA. A set of up to 32 Buffers can be bound at once. The layout of verrtex or instance data in all of the Buffers is defined by an Input Layout object. There is also a method for binding an Index Buffer to the IA (having a single Element format describing its data layout).

    
    
    IASetVertexBuffers( UINT StartSlot, // first Slot for which a Buffer is being bound
                        UINT NumBuffers, // number of slots having Buffers bound
                        ID3D10Buffer *const *pVertexBuffers,
                        const UINT *pStrides,
                        const UINT *pOffsets );
    
    
    IASetInputLayout( ID3D10InputLayout *pLayout,
                      ID3D10InputLayout* pInputLayout );
    
    
    IASetIndexBuffer( ID3D10Buffer* pBuffer,
                      DXGI_FORMAT Format,
                      UINT Offset );
    
    

    8.2 Drawing Commands

    The following rendering commands on a device, Draw()(8.3), DrawInstanced()(8.4), DrawIndexed()(8.5), DrawIndexedInstanced()(8.6), DrawInstancedIndirect()(8.7), and DrawIndexedInstancedIndirect()(8.8) introduce primitives into the D3D11.3 Pipeline.

    8.3 Draw()

    
    Draw(   UINT VertexCount
            UINT StartVertexLocation)
    
    
    UINT VertexCount How many vertices to read sequentially from the Vertex Buffer(s)
    UINT StartVertexLocation Which Vertex to start at in each Vertex Buffer.

    8.3.1 Pseudocode for Draw() Vertex Address Calculations and VertexID/PrimitiveID/InstanceID Generation in Hardware

    See the pseudocode for DrawInstanced(), below. Draw() behaves the same as DrawInstanced(), with InstanceCount = 1 and StartInstanceLocation = 0. If "Instance" data has been bound, it will be used. But the intent is for this method to be used without instancing.

    8.4 DrawInstanced()

    
    DrawInstanced(  UINT VertexCountPerInstance,
            UINT InstanceCount,
                    UINT StartVertexLocation,
                    UINT StartInstanceLocation)
    
    
    UINT VertexCountPerInstance How many vertices to read sequentially from Buffer(s) marked as Vertex Data (same set repeated for each Instance).
    UINT InstanceCount How many Instances to render.
    UINT StartVertexLocation Which Vertex to start at in each Buffer marked as Vertex Data (for each Instance).
    UINT StartInstanceLocation Which Instance to start sequentially fetching from in each Buffer marked as Instance Data.

    8.4.1 Pseudocode for DrawInstanced() Vertex Address Calculations in Hardware

    UINT VertexBufferElementAddressInBytes[32][32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT]
                                                    // [D3D11_IA_VERTEX_INPUT_STRUCTURE_ELEMENT_COUNT]
    
    UINT InstanceDataStepCounter[32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT]
    
    // Initialize starting Vertex Buffer addresses
    for(each slot, s, with a VertexBuffer assigned)
    {
        if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA)
        {
            for(each Element, e, in the Buffer's Input Layout)
            {
                VertexBufferElementAddressInBytes[s][e] =
                    Slot[s].VertexBufferOffsetInBytes +
                    Slot[s].StrideInBytes*StartVertexLocation +
                    Slot[s].pInputLayout->pElement[e].OffsetInBytes;
            } // Element loop
        }
        else // (Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA)
        {
            for(each Element, e, in the Buffer's Input Layout)
            {
                VertexBufferElementAddressInBytes[s][e] =
                    Slot[s].VertexBufferOffsetInBytes +
                    Slot[s].StrideInBytes*StartInstanceLocation +
                    Slot[s].pInputLayout->pElement[e].OffsetInBytes;
            } // Element loop
            InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate;
        }
    } // slot loop
    
    // Now compute addresses and fetch data
    // for all elements of each buffer for each vertex
    // for each instance.
    
    for(UINT InstanceID = 0;  InstanceID < InstanceCount; InstanceID++)
    {
        for(UINT VertexID = 0;  VertexID < VertexCountPerInstance; VertexID++)
        {
            for(each slot, s, with a VertexBuffer assigned)
            {
                for(each Element, e, in the buffer's Input Layout)
                {
                    // Fetch this vertex Element's data from Slot[s].pBuffer
                    // at address VertexBufferElementAddressInBytes[s][e],
                    // with type Slot[s].pInputLayout->pElement[e].Format,
                    // and output to the Shader Register identified by Slot[s].pInputLayout->pElement[e].Register,
                    // taking account the writemask declared in the shader.
                    FetchDataFromMemory(VertexBufferElementAddressInBytes[s][e],s,e);
    
                    if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA)
                    {
                        // Increment the address for the next access
                        VertexBufferElementAddressInBytes[s][e] +=
                            Slot[s].StrideInBytes;
                    }
                } // Element loop
            } // slot loop
        } // vertex loop
    
        // Patch Instance and Vertex Data addresses at the end of an instance.
        for(each slot, s, with a VertexBuffer assigned)
        {
            if(Slot[s].Class ==  D3D11_INPUT_PER_VERTEX_DATA)
            {
                for(each Element, e, in the buffer's structure declaration)
                {
                    VertexBufferElementAddressInBytes[s][e] =
                        Slot[s].VertexBufferOffsetInBytes +
                        Slot[s].StrideInBytes*StartVertexLocation +
                        Slot[s].pInputLayout->pElement[e].OffsetInBytes;
                } // Element loop
            }
            else //(Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA)
            {
                if(1 == InstanceDataStepCounter[s])
                {
                    for(each Element, e, in the buffer's structure declaration)
                    {
                        VertexBufferElementAddressInBytes[s][e] +=
                            Slot[s].StrideInBytes;
                    }
                    InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate;
                }
                else if(1 < InstanceDataStepCounter[s])
                {
                    InstanceDataStepCounter[s]--;
                }
            }
        } // slot loop
    
        RestartTopology(); // restart at the end of an instance
    } //instance loop
    
    

    8.4.2 Pseudocode for DrawInstanced() VertexID/PrimitiveID/InstanceID Calculations in Hardware

    // The following pseudocode for calculating IDs has been separated out from the
    // address calculation pseudocode above, for clarity. In practice the
    // algorithms would be merged, or possibly be implemented as part of the
    // primitive assembly process.  Note that VertexID/PrimitiveID/InstanceID
    // values are unrelated to address calculations for IA data fetching.
    // If desired, applications can choose ID starting values so that IDs can be used in
    // Shaders to load data from memory out of similar locations in memory as
    // the IA's fixed addressing calculations would have.
    
    UINT VertsPerPrimitive = GetNumVertsBetweenPrimsInCurrentTopology();
        // e.g. VertsPerPrimitive = 3 for tri list
        //                        = 6 for tri list w/adj
        //                        = 1 for tri strip
        //                        = 2 for tri strip w/adj
        //                        = 2 for line list
        //                        = 4 for line list w/adj
        //                        = 1 for line strip
        //                        = 1 for line strip w/adj
        //                        = 1 for point list
    
    UINT VertsPerCompletedPrimitive =
                GetNumVertsUntilFirstCompletedPrimitiveInCurrentTopology();
        // e.g. VertsPerCompletedPrimitive = 3 for tri list
        //                                 = 6 for tri list w/adj
        //                        =  3 for tri strip
        //                                 = 7 for tri strip w/adj, (not 6) since 1
        //                                        vert is not involved in the prim,
        //                                        when the strip has more than one
        //                                        primitive.
        //                                 = 2 for line list
        //                                 = 4 for line list w/adj
        //                                 = 2 for line strip
        //                                 = 4 for line strip w/adj
        //                                 = 1 for point list
    
    for(UINT InstanceID = 0; InstanceID < InstanceCount; InstanceID++)
    {
        UINT PrimitiveID = 0;
        UINT VertsUntilNextCompletePrimitive = VertsPerCompletedPrimitive;
    
        SetNextInstanceID(InstanceID); // subsequent vertices and primitives
                                       // will get this InstanceID
    
        for(UINT VertexID = 0; VertexID < VertexCountPerInstance; VertexID++)
        {
            VertsUntilNextCompletePrimitive--;
            if( VertsUntilNextCompletePrimitive == 0 )
            {
                SetNextPrimitiveID(PrimitiveID++);
                VertsUntilNextCompletePrimitive = VertsPerPrimitive;
            }
            SetNextVertexID(VertexID);
        } // vertex loop
    
        if( IsTriangleStripWithAdjacency() && (VertsUntilNextCompletePrimitive == 1)
        {
            // When traversing a triangle strip w/ adjacency, after the initial 7
            // vertices, every other vertex completes a primitive, EXCEPT when
            // the end of the strip is reached, where the last 2 consecutive
            // vertices each complete a primitive.
            SetNextPrimitiveID(PrimitiveID++); // in a tristrip w/adj
                                               // the last completed primitive has
                                               // not been counted yet.
        }
    } // instance loop
    

    8.5 DrawIndexed()

    
    DrawIndexed(    UINT IndexCount,
                    UINT StartIndexLocation,
                    INT  BaseVertexLocation)
    
    
    UINT IndexCount How many indices to read sequentially from the Index Buffer.
    UINT StartIndexLocation Which Index to start at in the Index Buffer.
    INT BaseVertexLocation Which Vertex in each buffer marked as Vertex Data to consider as Index "0". Note that this value is signed. A negative BaseVertexLocation allows, for example, the first vertex to be referenced by an index value > 0.

    8.5.1 Pseudocode for DrawIndexed() Vertex Address and VertexID/PrimitiveID/InstanceID Calculations in Hardware

    See the pseudocode for DrawIndexedInstanced(), below. DrawIndexed() behaves the same as DrawIndexedInstanced(), with InstanceCount = 1 and StartInstanceLocation = 0. If "Instance" data has been bound, it will be used. But the intent is for this method to be used without instancing.

    8.6 DrawIndexedInstanced()

    
    DrawIndexedInstanced(   UINT IndexCountPerInstance,
                            UINT InstanceCount,
                            UINT StartIndexLocation,
                            INT  BaseVertexLocation,
                            UINT StartInstanceLocation)
    
    
    UINT IndexCountPerInstance How many indices to read sequentially from the Index Buffer (same set repeated for each Instance).
    UINT InstanceCount How many Instances to render.
    UINT StartIndexLocation Which Index to start at in the Index Buffer (for each Instance).
    INT BaseVertexLocation Which Vertex in each buffer marked as Vertex Data to consider as Index "0". Note that this value is signed. A negative BaseVertexLocation allows, for example, the first vertex to be referenced by an index value > 0.
    UINT StartInstanceLocation Which Instance to start sequentially fetching from in each Buffer marked as Instance Data.

    8.6.1 Pseudocode for DrawIndexedInstanced() Vertex Address Calculations in Hardware

    
    UINT VertexBufferElementAddressInBytes[32][32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT]
                                                    // [D3D11_IA_VERTEX_INPUT_STRUCTURE_ELEMENT_COUNT]
    UINT InstanceDataStepCounter[32]; // [D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT]
    
    // Initialize starting Index Buffer address
    UINT IndexBufferElementAddressInBytes = StartIndexLocation*sizeof(IndexBuffer.Format) + IndexBufferOffsetInBytes;
    
    // Initialize starting Vertex Buffer addresses
    // (relevant to Instance Data only, as this is traversed without indexing.
    for(each slot, s, with a VertexBuffer assigned)
    {
        if(Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA)
        {
            for(each Element, e, in the Buffer's structure declaration)
            {
                VertexBufferElementAddressInBytes[s][e] =
                    Slot[s].VertexBufferOffsetInBytes +
                    Slot[s].StrideInBytes*StartInstanceLocation +
                    Slot[s].pInputLayout->pElement[e].OffsetInBytes;
            } // Element loop
            InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate;
        }
    } // slot loop
    
    // Now compute addresses and fetch data
    // for all elements of each buffer for each vertex
    // for each instance.
    
    for(UINT InstanceID = 0; InstanceID < InstanceCount; InstanceID++)
    {
        for(UINT i = 0; i < IndexCountPerInstance; i++)
        {
            UINT IndexValue = FetchIndexFromIndexBuffer(IndexBufferElementAddressInBytes,IndexBuffer.Format)
    
            if(GetPredefinedCutIndexValue(IndexBuffer.Format) == IndexValue)
            {
                RestartTopology();
    
                // Increment the index address
                IndexBufferElementAddressInBytes += sizeof(IndexBuffer.Format);
    
                // No vertex to fetch for this iteration...
                continue;
            }
    
            for(each slot, s, with a VertexBuffer assigned)
            {
                UINT IndexedOffset;
                if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA)
                {
                    IndexedOffset = Slot[s].StrideInBytes*( BaseVertexLocation + IndexValue);
                }
                for(each Element, e, in the buffer's structure declaration)
                {
                    if(Slot[s].Class == D3D11_INPUT_PER_VERTEX_DATA)
                    {
                        VertexBufferElementAddressInBytes[s][e] =
                            Slot[s].VertexBufferOffsetInBytes +
                            IndexedOffset +
                            Slot[s].pInputLayout->pElement[e].OffsetInBytes;
                    }
    
                    // Fetch this vertex Element's data from Slot[s].pBuffer
                    // at address VertexBufferElementAddressInBytes[s][e],
                    // with type Slot[s].pInputLayout->pElement[e].Format,
                    // and output to the Shader Register identified by Slot[s].pInputLayout->pElement[e].Register,
                    // taking account the writemask declared in the shader.
                    FetchDataFromMemory(VertexBufferElementAddressInBytes[s][e],s,e);
    
                } // Element loop
            } // slot loop
            // Increment the index address
            IndexBufferElementAddressInBytes += sizeof(IndexBuffer.Format);
        } // index loop
    
    
        // Patch Instance Data addresses at the end of an instance.
        for(each slot, s, with a VertexBuffer assigned)
        {
            if(Slot[s].Class == D3D11_INPUT_PER_INSTANCE_DATA)
            {
                if(1 == InstanceDataStepCounter[s])
                {
                    for(each Element, e, in the buffer's structure declarationn)
                    {
                        VertexBufferElementAddressInBytes[s][e] +=
                            Slot[s].StrideInBytes;
                    }
                    InstanceDataStepCounter[s] = Slot[s].InstanceDataStepRate;
                }
                else if(1 < InstanceDataStepCounter[s])
                {
                    InstanceDataStepCounter[s]--;
                }
            }
        } // slot loop
    
        RestartTopology();  // restart at the end of an instance
    } //instance loop
    

    8.6.2 Pseudocode for DrawIndexedInstanced() VertexID/PrimitiveID/InstanceID Calculations in Hardware

    // The following pseudocode for calculating IDs has been separated out from the
    // address calculation pseudocode above, for clarity. In practice the
    // algorithms would be merged, or possibly be implemented as part of the
    // primitive assembly process.  Note that VertexID/PrimitiveID/InstanceID
    // values are unrelated to address calculations for IA data fetching.
    // If desired, applications can choose ID starting values so that IDs can be used in
    // Shaders to load data from memory out of similar locations in memory as
    // the IA's fixed addressing calculations would have.
    
    UINT VertsPerPrimitive = GetNumVertsBetweenPrimsInCurrentTopology();
        // e.g. VertsPerPrimitive = 3 for tri list
        //                        = 6 for tri list w/adj
        //                        = 1 for tri strip
        //                        = 2 for tri strip w/adj
        //                        = 2 for line list
        //                        = 4 for line list w/adj
        //                        = 1 for line strip
        //                        = 1 for line strip w/adj
        //                        = 1 for point list
    
    UINT VertsPerCompletedPrimitive =
                GetNumVertsUntilFirstCompletedPrimitiveInCurrentTopology();
        // e.g. VertsPerCompletedPrimitive = 3 for tri list
        //                                 = 6 for tri list w/adj
        //                                 = 3 for tri strip
        //                                 = 7 for tri strip w/adj, (not 6) since 1
        //                                        vert is not involved in the prim,
        //                                        when the strip has more than one
        //                                        primitive.
        //                                 = 2 for line list
        //                                 = 4 for line list w/adj
        //                                 = 2 for line strip
        //                                 = 4 for line strip w/adj
        //                                 = 1 for point list
    
    UINT CutIndexValue = GetPredefinedCutIndexValue(IndexBuffer.Format);
    
    for(UINT InstanceID = 0; InstanceID < InstanceCount; InstanceID++)
    {
        UINT PrimitiveID = 0;
        UINT VertsUntilNextCompletePrimitive = VertsPerCompletedPrimitive;
    
        SetNextInstanceID(InstanceID); // subsequent vertices and primitives
                                       // will get this InstanceID
        for(UINT i = 0; i < IndexCountPerInstance; i++)
        {
            UINT IndexValue = FetchIndexFromIndexBuffer(); // detail hidden
            // IndexValue assignment above: Detail hidden, see full index fetch calculation in
            // DrawIndexedInstanced() pseudocode (which in practice this code would be merged with)
    
            if(CutIndexValue == IndexValue)
            {
                if( IsTriangleStripWithAdjacency() && (VertsUntilNextCompletePrimitive == 1)
                {
                    // When traversing a triangle strip w/ adjacency, after the initial 7
                    // vertices, every other vertex completes a primitive, EXCEPT when
                    // the end of the strip is reached, where the last 2 consecutive
                    // vertices each complete a primitive.
                    SetNextPrimitiveID(PrimitiveID++); // in a tristrip w/adj
                                                       // the last completed primitive has
                                                       // not been counted yet.
                }
                VertsUntilNextCompletePrimitive = VertsPerCompletedPrimitive;
            }
            else
            {
                VertsUntilNextCompletePrimitive--;
                if( VertsUntilNextCompletePrimitive == 0 )
                {
                    SetNextPrimitiveID(PrimitiveID++);
                    VertsUntilNextCompletePrimitive = VertsPerPrimitive;
                }
                SetNextVertexID(IndexValue);
            }
        } // vertex loop
    
        if( IsTriangleStripWithAdjacency() && (VertsUntilNextCompletePrimitive == 1)
        {
            // When traversing a triangle strip w/ adjacency, after the initial 7
            // vertices, every other vertex completes a primitive, EXCEPT when
            // the end of the strip is reached, where the last 2 consecutive
            // vertices each complete a primitive.
            SetNextPrimitiveID(PrimitiveID++);  // in a tristrip w/adj
                                                // the last completed primitive has
                                                // not been counted yet.
        }
    } // instance loop
    

    8.7 DrawInstancedIndirect()

    DrawInstancedIndirect(
        ID3D11Buffer *pBufferForArgs,
        UINT AlignedByteOffsetForArgs);
    
    struct DrawInstancedIndirectArgs
    {
        UINT VertexCountPerInstance,
        UINT InstanceCount,
        UINT StartVertexLocation,
        UINT StartInstanceLocation)
    }
    
    ID3D11Buffer *pBufferForArgs A buffer that contains an array of DrawInstancedArgs, described in the struct above.
    UINT AlignedByteOffsetForArgs A DWORD aligned - byte offset for the data.
    UINT VertexCountPerInstance How many vertices to read sequentially from Buffer(s) marked as Vertex Data (same set repeated for each Instance).
    UINT InstanceCount How many Instances to render.
    UINT StartVertexLocation Which Vertex to start at in each Buffer marked as Vertex Data (for each Instance).
    UINT StartInstanceLocation Which Instance to start sequentially fetching from in each Buffer marked as Instance Data.

    If the address range in the Buffer where DrawInstancedIndirect’s parameters will be fetched from would go out of bounds of the Buffer, behavior is undefined.

    Here(18.6.5.1) is a discussion about ways to initialize the arguments for DrawInstancedIndirect.

    8.8 DrawIndexedInstancedIndirect()

    DrawIndexedInstancedIndirect(
            ID3D11Buffer *pBufferForArgs,
            UINT AlignedByteOffsetForArgs);
    
    struct DrawIndexedInstancedIndirectArgs
    {
        UINT IndexCountPerInstance,
        UINT InstanceCount,
        UINT StartIndexLocation,
        UINT BaseVertexLocation,
        UINT StartInstanceLocation)
    }
    
    ID3D11Buffer *pBufferForArgs A buffer that contains an array of DrawInstancedArgs, described in the struct above.
    UINT AlignedByteOffsetForArgs A DWORD aligned byte offset for the data.
    UINT IndexCountPerInstance How many indices to read sequentially from the Index Buffer (same set repeated for each Instance).
    UINT StartIndexLocation Which Index to start at in the Index Buffer.(for each Instance).
    UINT InstanceCount How many Instances to render.
    INT BaseVertexLocation Which Vertex in each buffer marked as Vertex Data to consider as Index "0". Note that this value is signed. A negative BaseVertexLocation allows, for example, the first vertex to be referenced by an index value > 0.
    UINT StartInstanceLocation Which Instance to start sequentially fetching from in each Buffer marked as Instance Data.

    If the address range in the Buffer where DrawIndexedInstancedIndirect’s parameters will be fetched from would go out of bounds of the Buffer, behavior is undefined.

    Here(18.6.5.1) is a discussion about ways to initialize the arguments for DrawIndexedInstancedIndirect.

    8.9 DrawAuto()

    DrawAuto is used with StreamOutput(14) in order to use a Stream Output Buffer as an Input Assembler Vertex Input Buffer without requiring the BufferFilledSize to get back to the CPU. The Buffer bound to slot zero must have both the Stream Output andInput Assembler Vertex Input Bind Flags set. When invoked, DrawAuto will draw from the Buffer offset associate with slot zero to the BufferFilledSize(14.4) associated with the Buffer. If the BufferFilledSize is less then or equal to the specified buffer offset, then nothing is drawn. The primitive type for DrawAuto is the current primitive topology set via IASetPrimitiveTopology(8.1.2), regardless of the geometry shader output topology used while the buffer is filled.

    Buffers may be bound to other IA input slots above zero for DrawAuto (only the IA bind flag is required on these slots), and these can be part of the Vertex Declaration as well. Reading out of bounds on any Buffer above slot zero in DrawAuto invokes the default behavior for reading out of bounds (as with any other Draw* call).

    DrawAuto()
    

    8.10 Primitive Topologies

    The diagram below defines the vertex ordering for all of the primitive topologies that the IA can produce. The enumeration of primitive topologies is here(8.1.2).

    As an example, suppose the IA is asked to draw triangle lists with adjacency, and it is invoked with a vertex cont of 36 by a Draw() call. From the diagram it should be apparent that a 36-vertex triangle list with adjacency will result in 6 completed primitives.

    An interesting property of all the topologies with adjacency (except line strips) is that they contain exactly double the number of vertices as the equivalent topology without adjacency. Every other vertex represents an "adjacent" vertex.

    8.11 Patch Topologies

    Not shown in the previous diagram (but part of the same list) are 32 additional topologies which represent 1...32 control point patches, respectively. These Patch topologies can be used with Tessellation(11). Also, when Tessellation is disabled(11.8) (meaning no Hull Shader and no Domain Shader bound), they can be fed to the Geometry Shader and/or Stream Output, allowing patch data to be saved to memory, and allowing non-traditional primitive types to be fed to the GS (such as simulating cubes using 8 control point patches to represent 8 vertices).

    8.12 Generating Multiple Strips

    In Indexed rendering of strip topologies, the maximum representable index value in the index format (i.e. 0xffffffff for 32-bit indices) means the strip defined up to the previous index is to be completed, and the next index is a new strip. This special "cut" value is not required to be used, in which case a DrawIndexed*() command will simply draw one strip. In IndexedInstanced rendering, there is an automatic "cut" after every instance. Regardless of Instanced rendering or not, it is optional whether to make the last index the cut value, or omit the value; both result in the same behavior, except that the IndexCount[PerInstance] parameter to DrawIndexed[Instanced]() is different by 1.

    Even if the current Primitive Topology is not a strip, then the cut index value still takes effect, potentially resulting in an incomplete primitive (see next section). Thus, handling of the cut is kept orthogonal to primitive topology, even though it is not useful for some of them.

    Note that providing a behavior for the cut value when used with a non-strip topology is a way of saying that the behavior is defined, allowing hardware to keep the cut behavior always enabled. In practice though, using cut for a list topology is obviously not a "feature" that it would ever make sense for an application to author to.

    8.13 Partially Completed Primitives

    Each Draw*() call starts a new Primitive Topology; there is no persistence of any topology produced by a previous Draw() call. Triangle strips don't continue across Draw() call boundaries.

    If a Draw*() call produces incomplete primitives (not enough vertices), either at the end of the Draw*() call, or anywhere in the middle (possible with the "cut" index), any incomplete primitives are silently discarded. For example, suppose a Draw*() call is made with triangle list as the topology, and an vertex count of 5. This case would result in a single triangle, and the last 2 vertices being silently discarded. For another example showing handling of an incomplete primitive, see the diagram under the Geometry Shader Stage here(13.10), depicting which primitives are instantiated given a triangle strip with adjacency that has a dangling vertex.

    8.14 Leading Vertex

    For the purpose of assigning constant vertex attributes to primitives, there must be a way to map a vertex to a primitive. Let us identify the vertex in a primitive which supplies its per-primitive constant data as the "leading vertex". A primitive topology can have multiple leading vertices, one for each primitive in the topology. The leading vertex for an individual primitive in a topology is the first non-adjacent vertex in the primitive. For the triangle strip with adjacency above, the leading vertices are 0, 2, 4, 6, etc. For the line strip with adjacency, the leading vertices are 1, 2, 3 etc.

    Note that adjacent primitives have no leading vertex. This means that there is no primitive data associated with adjacent primitives. With the strip topologies, a given interior primitive has some adjacent primitives which are also interior to the topology, and so actually can have primitive data. However, as far as the Geometry Shader is concerned (it sees a single primitive and its neigboring primitives in an invocation), only the single interior primitive defining the Geometry Shader invocation can have Primitive Data, and adjacent primitives, whether they are interior to the strip or adjacent primitives on the strip, never come with Primitive Data.

    8.15 Adjacency

    The only place in the Pipeline where adjacency information is visible to the application is in the Geometry Shader. Each invocation of the Geometry Shader sees a single primitive: a point, line, or triangle, and some of these might include adjacent vertices.

    The inputs to the Geometry Shader are like a single primitive of any of the "list" primitive topologies (with or without adjacency) in the diagram above. When adjacency is available, the Geometry Shader will see the appropriate adjacent vertices along with the primitive's vertices. So for example if the Geometry Shader is invoked with a triangle including adjacency (the source could have been a strip with adjacency), this would mean that data for 6 vertices would be available as input in the Geometry Shader: 3 vertices for the triangle, and 3 for the adjacency.

    The data layout for adjacent vertices is identical to the standard vertices they accompany. Note that Vertex Shaders are always run on all vertices, including adjacent vertices. Note that adjacent vertices are typically also surface vertices some other primitive that will get drawn, so the Vertex Shader result cache can take advantage of this.

    When the IA is instructed to produce a primitive topology with adjacency for its output, all adjacent vertices must be specified. There is no concept of handling edges with no adjacent primitive. The application must deal with this on their own, perhaps by providing a dummy vertex (possibly forming a degenerate triangle), or perhaps by flagging in one of the vertex attributes whether the vertex "exists" or not. The application's Geometry Shader code will have to detect this situation, if desired, and deal with it manually. Implied in this is that there must be no culling of degenerate primitives until rasterizer setup, so that the Geometry Shader is guaranteed to see all geometry.

    Note that when Tessellation is enabled, topologies with adjacency cannot be used. The Tessellator operates a patch at a time without hardware knowledge about adjacency (although shader code is free to encode it on its own). The Tessellator's outputs are independent primitives, with no adjacency information.

    8.16 VertexID

    VertexID is a 32-bit unsigned integer scalar counter value coming out of Draw*() calls identifying to Shaders each vertex. This value can be declared(22.3.11) for input by the Vertex Shader only.

    For Draw() and DrawInstanced(), VertexID starts at 0, and it increments for every vertex. Across instances in DrawInstanced(), the count resets back to the start value. Should the 32-bit VertexID calculation overflow, it simply wraps.

    For DrawIndexed() and DrawIndexedInstanced(), VertexID represents the index value.

    The mere presence of VertexID in a Vertex Shaders' input declarations activates the feature (there is no other control outside the shader). If the application wishes to pass this data to later Pipeline stages, the application can do so by simply writing the value to a Shader output register like any other data.

    For Primitive Topologies with adjacency, such as a triangle strip w/adjacency, the "adjacent" vertices participate have a VertexID associated with them just like the "non-adjacent" vertices do, all generated uniformly (i.e. without regards to which vertices are adjacent and which are not in the topology).

    For more information, see the general discussion of System Generated Values here(4.4.4), the reference for VertexID here(23.1), and the System Interpreted/Generated Value input(22.3.11) declaration for Shaders.

    8.17 PrimitiveID

    PrimitiveID is a 32-bit unsigned integer scalar counter value coming out of Draw*() calls identifying to Shaders each primitive. This value can be declared(22.3.11) for input by either the Hull Shader, Domain Shader, Geometry Shader or Pixel Shader Stage. For the GS and PS, if the GS is active the hardware PrimitiveID goes there and shader computed PrimitiveIDs go to the PS.

    PrimitiveID starts at 0 for the first primitive generated by a Draw*() call, and increments for each subsequent primitive. When Draw*Instanced() is used, the PrimitiveID resets to its starting value whenever a new instance begins in the set of instances produced by the call. Should the 32-bit PrimitiveID calculation overflow, it simply wraps.

    The mere presence of PrimitiveID in a compatible Shader Stage's input declarations activates the feature (there is no other control outside the shader). In the Geometry Shader this is declared as the special register vPrim (to decouple the value from the other per-vertex inputs). If the application wishes to pass PrimitiveID to a later Pipeline stage, the application can do so by simply writing the value to a Shader output register like any other data. The Pixel Shader does not have a separate input for PrimitiveID; it just goes into a component of a normal input register, with the requirement that the interpolation mode on the entire input register (which may contain other data as well in other components, is chosen as "constant".

    For Primitive Topologies(8.10) with adjacency, such as a triangle strip w/adjacency, the PrimitiveID is only maintained for the interior primitives in the topology (the non-adjacent primitives), just like the set of primitives in a triangle strip without adjacency. No point in the Pipeline has a way of asking for an auto-generated PrimitiveID for adjacent primitives.

    For more information, see the general discussion of System Generated Values here(4.4.4), the reference for PrimitiveID here(23.2), and the System Interpreted/Generated Value input(22.3.11) and output(22.3.33) declarations for Shaders.

    8.18 InstanceID

    InstanceID is a 32-bit unsigned integer scalar counter value coming out of Draw*() calls identifying to Shaders which instance is being drawn. This value can be declared(22.3.11) for input by the by the Vertex Shader only.

    InstanceID starts at 0 for the first instance of vertices generated by a Draw*() call. If the Draw is a Draw*Instanced() call, after each instance of vertices, the InstanceID increments by one. If the Draw is not a Draw*Instanced() call, then InstanceID never changes. Should the 32-bit InstanceID calculation overflow, it simply wraps.

    The mere presence of InstanceID in the Vertex Shader's input declarations activates the feature (there is no other control outside the shader). If the application wishes to pass this data to later Pipeline stages, the application can do so by simply writing the value to a Shader output register like any other data.

    For more information, see the general discussion of System Generated Values here(4.4.4), the reference for InstanceID here(23.3), and the System Interpreted/Generated Value input(22.3.11) declaration for Shaders.


    8.19 Misc. IA Issues


    Section Contents

    (back to chapter)

    8.19.1 Input Assembler Arithmetic Precision
    8.19.2 Addressing Bounds
    8.19.3 Buffer and Structure Offsets and Strides
    8.19.4 Reusing Input Resources
    8.19.5 Fetching Data in the IA vs. Fetching Later (i.e. Multiple Ways to Do the Same Thing)


    8.19.1 Input Assembler Arithmetic Precision

    The Input Assembler performs 32-bit unsigned integer arithmetic, conforming to the IA addressing pseudocode shown in this spec. In other words, should any calculation overflow 32-bits, it would wrap - and should that result happen to fall back into a valid range for the scenario, so be it. Wherever input parameters are listed as signed integers (such as BaseVertexLocation in DrawIndexed()(8.5)) they are interpreted, unaltered, as unsigned 32-bit numbers, used in unsigned 32-bit addressing arithmetic, producing unsigned 32-bit results.

    8.19.2 Addressing Bounds

    An individual Draw*() call is limited to producing a finite number of vertices. This limit includes any instancing that is occurring within the Draw*() call. Independent of such a limit, there are also limits on how big various source data buffers can be. All of these (large) numbers can be found within the table(21) in the Limits On Various System Resource section. These numbers are expected to be reasonable for the foreseeable lifetime of D3D11.3.

    Any calculated address that would fall out of bounds for a Buffer being accessed results in out-of-bounds behavior being invoked, where the return is 0 in all non-missing components of the format (defined in the Input Layout), and the default for missing components (see Defaults for Missing Components(19.1.3.3)). This out-of-bounds behavior applies, for example, when an index refers to a vertex number that is outside of the bound vertex buffer.

    The minimum extent for the bounds is any initial offset applied on the Buffer (so "negative" indexing isn't a feature).

    8.19.3 Buffer and Structure Offsets and Strides

    See the Element Alignment(4.4.6) section.

    8.19.4 Reusing Input Resources

    It is perfectly legal to read any given memory Buffer in multiple places in the Pipeline, including the IA, simultaneously, even applying different interpretations to the data in the Buffer. A single Buffer can even be set as input at multiple slots at a single stage such as the IA.

    For example, suppose an application has a Vertex Shader that requires 2 different sets of input texture coordinates. One scenario could be to use 2 different input Buffers to provide the different texture coordinates to be fetched by the IA (or both texture coordinates could be interleaved in one Buffer). But an alternate, equally valid scenario is to reuse the same source data to supply both texture coordinates to what the Vertex Shader expects as two different sets. This is simply a matter of binding the same input Buffer to two different input slots.

    Another way to achieve the same effect of reusing a single set of data is to bind the source texture coordinate Buffer to a single slot and provide a data declaration where the definition of 2 different texture coordinates overlaps (same structure offset). Partial-overlapping of types in a data declaration is even permitted (even though it isn't interesting); the point is that D3D11.3 doesn't care or bother to check.

    Similarly, the structure stride in a vertex declaration can be any non-negative value (up to a maximum of 2048 Bytes, and conforming to alignment(4.4.6) rules), without regards to whether it is large enough to support the fields defined for the structure. Again, the point is that D3D11.3 doesn't care or bother to check. Debug tools can be provided to optionally enforce well-ordered, logical data layouts, however the arithmetic that underlying hardware uses to actually address data simply blindly follows the intent shown by the pseudocode for address-calculations for the Draw*()(8.2) routines.

    It is legal to have a single Buffer containing both vertex data and index data, and thus bind the Buffer at both a Vertex Buffer input slot and as an Index Buffer simultaneously. One might store indices at the beginning of the Buffer and the vertex data being referred to elsewhere in the same Buffer. D3D11.3 doesn't care.

    As yet another, final (contrived) example, to drive the point home: Suppose a Vertex Buffer is set as input to the IA to provide data for vertices going to the Vertex Shader (as usual). Simultaneously, the same Vertex Buffer may be accessed directly by the Vertex Shader, if for some reason the Shader occasionally wanted to look at some of the input data for vertices other than itself.

    8.19.5 Fetching Data in the IA vs. Fetching Later (i.e. Multiple Ways to Do the Same Thing)

    The highly flexible and programmable nature of the D3D11.3 Pipeline leads to many situations where there are multiple ways to accomplish a single task. A particular example relevant to this section is that the fetching of vertex data performed by the IA can be identically performed by doing memory fetches from the Vertex Shader only given a VertexID as input. There are nice properties from this, such as the fact that even though the amount of data the IA can pre-fetch for a single vertex is limited in size, memory fetches from shaders can allow much more unbounded amounts of vertex data to be fetched if necessary. Memory fetches from shaders can also use much more complex addressing arithmetic than the common-case dedicated fixed-function arithmetic used by the IA.

    No guarantees or requirements are made by D3D11.3, however, as to the performance characteristics of using alternative mechanisms to perform a task that can be performed by an explicit feature intended for that task in the Pipeline. As a general rule, whenever there is an explicit mechanism to perform a task in D3D11.3, IHVs and ISVs should assume that as much as possible, the dedicated functionality is the preferred route, at least when all of or most of the other parts of the graphics Pipeline are simultaneously active.

    8.20 Input Assembler Data Conversion During Fetching

    When the Input Assembler reads Elements of data from Buffers, it gets converted to the appropriate 32-bit data type for the Format(19.1) interpretation specified. The conversion uses the the Data Conversion(3.2) rules. If the source data contains 32-bit per-component float, UINT or SINT data, it is read without modifying the bits at all (no conversion).

    If a Vertex Buffer or Index Buffer is read by the Input Assembler, but the slot being read has no Buffer bound, the result of the read is 0 for all expected components. Even though there is format information available via the input layout, defaults are not applied to missing channels for this case.

    8.21 IA Example

    The following example shows DrawIndexedInstanced()(8.6) being used to draw 3 instances of an indexed mesh.

    The example does not attempt to draw anything particularly interesting, but it does show most of the functionality of the IA being used at once, in complete detail. Included is a depiction of the resulting workload for the rest of the Graphics Pipeline.

    As input, one Vertex Buffer supplies Vertex Data, another Vertex Buffer supplies Instance Data, and there is an Index Buffer. The data layouts and configuration of all of these buffers is illustrated. VertexID(8.16), PrimitiveID(8.17) and InstanceID(8.18) are all shown as well, assuming Shaders in the pipeline requested them. The Primitive Topology(8.10) being rendered is triangle strip with adjacency. The Index Buffer has a Cut(8.12) in it, so multiple strips are produced (per instance).

    Various states shown in boxes represent the API settings for Buffers and for the IA states described earlier in this IA spec.




    9 Vertex Shader Stage


    Chapter Contents

    (back to top)

    9.1 Vertex Shader Instruction Set
    9.2 Vertex Shader Invocation
    9.3 Vertex Shader Inputs
    9.4 Vertex Shader Output
    9.5 Registers


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    9.1 Vertex Shader Instruction Set

    The Vertex Shader instruction set is listed here(22.1.3).

    9.2 Vertex Shader Invocation

    For every vertex generated by the IA, Vertex Shader is invoked, provided that there is a miss on the hardware's Vertex Shader result cache. Adjacent vertices are treated equivalently to interior vertices in a topology, so the Vertex Shader is executed for all vertices.

    9.3 Vertex Shader Inputs

    The primary inputs to a Vertex Shader invocation are 32 32-bit*4-component registers (v#) comprising the elements of the input vertex (not all have to be used). ConstantBuffers (cb#) and textures (t#) provide random access input to Vertex Shaders.

    9.4 Vertex Shader Output

    The output of a Vertex Shader is up to 32 32-bit*4 component registers (o#). The o# registers to be written by the Shader must be declared (i.e. "dcl_output o[3].xyz").

    9.5 Registers

    The following registers are available in the vs_5_0 model:

    Register Type Count r/w Dimension Indexable by r#Defaults Requires DCL
    32-bit Temp (r#) 4096 (r# + x#[n]) r/w 4 n none y
    32-bit Indexable Temp Array (x#[n]) 4096 (r# + x#[n]) r/w 4 y none y
    32-bit Input (v#) 32 r 4 y none y
    Element in an input resource (t#) 128 r 1 n none y
    Sampler (s#) 16 r 1 n none y
    ConstantBuffer reference (cb#[index]) 15 r 4 y(contents)none y
    Immediate ConstantBuffer reference (icb[index]) 1 r 4 y(contents)none y
    Output Registers:
    NULL (discard result, useful for ops with multiple results) n/a w n/a n/a n/a n
    32-bit output Vertex Data Element (o#) 32 w 4 n/a n/a y
    Unordered Access View (u#) 64 r/w 1 n n y

    10 Hull Shader Stage


    Chapter Contents

    (back to top)

    10.1 Hull Shader Instruction Set
    10.2 Hull Shader Invocation
    10.3 HS State Declarations
    10.4 HS Control Point Phase
    10.5 HS Patch Constant Phases
    10.6 Hull Shader Structure Summary
    10.7 Hull Shader Control Point Phase Contents
    10.8 Hull Shader Fork Phase Contents
    10.9 Hull Shader Join Phase Contents
    10.10 Hull Shader Tessellation Factor Output
    10.11 Restrictions on Patch Constant Data
    10.12 Shader IL "Ret" Instruction Behavior in Hull Shader
    10.13 Hull Shader MaxTessFactor Declaration


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    For a Tessellation overview, see the Tessellator(11) section.

    10.1 Hull Shader Instruction Set

    The Hull Shader instruction set is listed here(22.1.4).

    10.2 Hull Shader Invocation

    The Hull Shader operates once per patch, transforming Control Points, computing Patch Constant data and defining Tessellation Factors.

    The Hull Shader has four phases, all defined together as one program. That is, from the API/DDI point of view, the Hull Shader is a single atomic shader, and its phases are an implementation detail within the Hull Shader program. Implementations can choose to exploit independent work within a Patch by executing work within a single patch in parallel.

    The phases appear in the Intermediate Language as standalone shaders, each with individual input and output declarations tailored to what each independent program is doing. However the inputs and outputs across all of the shaders come out of a fixed pool of Hull Shader-wide input data and output storage, described later in great detail.

    The Hull Shader phase structure is depicted in the following picture:

    10.3 HS State Declarations

    This section of the Hull Shader has no executable code. It simply declares some overall characteristics of Hull Shader operation, such as how many control points the HS inputs and outputs (an independent number). The operation of the fixed function Tessellator is also defined here – such as choosing the patch domain, partitioning etc. A tessellation pattern overview is given here(11.7).

    Note that declarations that are typical in shaders, such as input and output register declarations and declarations of input Resources, Constant Buffers, Samplers etc. are part of each individual shader phase below, not part of this HS State declaration section.

    See Tessellator State(11.7.15).

    10.4 HS Control Point Phase

    In the Hull Shader’s Control Point phase, a thread is invoked once per patch output control point. An input value vOutputControlPointID(23.7) identifies to each thread which output control point it represents. Each of the threads see shared input of all the input control points for the patch. The output of each thread is one of the output control points for the patch.


    10.5 HS Patch Constant Phases


    Section Contents

    (back to chapter)

    10.5.1 Overview
    10.5.2 HS Patch Constant Fork Phase
    10.5.3 HS Patch Constant Join Phase


    10.5.1 Overview

    The Patch Constant phases compute constant data such as Tessellation Factors(10.10) (how much the fixed function Tessellator should tessellate), as well as any other Patch Constant data, beyond the patch Control Points, that the application may need in the Domain Shader(12) (the shader that runs once per Tessellator output point).

    The Patch Constant phases occur after the Control Point phase is complete, and has read-only access to all of the input and output Control Points. So for example, Control Points could be examined to help calculate Tessellation Factors(10.10) for each patch edge.

    There are two Patch Constant phases:

    10.5.2 HS Patch Constant Fork Phase

    The Patch Constant Fork Phase is a collection of an arbitrary number of independent programs. For the discussion in this section let us call these independent programs mini-shaders.

    Each mini-shader produces independent (non-overlapping) parts of the total output Patch Constant data (such as all the different TessFactors(10.10)).

    An implementation could choose to execute each mini-shader in parallel, since they are independent. Or, in the opposite extreme an implementation could choose to trivially concatenate all the mini-shaders together and run them serially. Such transformations of the mini-shaders are trivial to perform (in a driver’s compiler) given they all share the same inputs and perform non-overlapping writes to a unified output space.

    An implementation could even choose to hoist any amount of the code from the Fork Phase phase up into the Control Point Phase if that happened to be most efficient. This is allowable because all the parts of a Hull Shader are specified together as if it is one program – how its contents are executed does not matter as long as the output is deterministic.

    The shared inputs to each mini-shader are all of the Control Point Phase’s Input and Output Control Points.

    The output of each mini-shader is a non overlapping subset of the output Patch Constant Data.

    There is no communication of data between mini-shaders, other than the fact that they share Control Point input.

    To further enable parallelism within a single mini-shader, any mini-shader can be declared to run in an instanced fashion, given a fixed instance count per patch. During execution, each instance of an instance mini-shader is identified by a ForkInstanceID(23.8) and is responsible for producing a unique output, typically by indexing an array of outputs. So for example, a single mini-shader instanced 4 times could output edge TessFactors for each edge of a quad patch.

    10.5.3 HS Patch Constant Join Phase

    The final Hull Shader phase is the Patch Constant Join Phase. This phase behaves the same way as the Fork Phase, in that there can be multiple Join programs that are independent of each other. All of them execute after all the Fork Phase programs. An example use for this phase is to derive TessFactors(10.10) for the inside of a patch given the edge TessFactors computed in the previous phase.

    The input to each Patch Constant Join Phase shader are all the Control Point Phase’s Input and Output Control Points as well as all the Patch Constant Fork Phase’s output.

    The output of each Patch Constant Join Phase shaders is a subset of the output Patch Constant data that does not overlap any of the outputs of the shaders from the Patch Constant Fork Phase or other Join Phase shaders.

    Similar to the fork phase, to enable parallelism within a join phase mini-shader, any mini-shader can be declared to run in an instanced fashion, given a fixed instance count per patch. During execution, each instance of an instance mini-shader is identified by a JoinInstanceID(23.9) and is responsible for producing a unique output, typically by indexing an array of outputs. So for example, a single mini-shader instanced 2 times could output inside TessFactors for each inside direction of a quad patch.

    10.6 Hull Shader Structure Summary

    The various phases of the Hull Shader are described in the Intermediate Language as separate shader models. A single Hull Shader program consists of a collection of the following shaders appearing in the order listed here:

    hs_decls(22.3.14): Hull Shader State Declarations

    hs_control_point_phase(22.3.21): Hull Shader Control Point Phase

    hs_fork_phase(22.3.23): Hull Shader Patch Constant Fork Phase

    hs_join_phase(22.3.26): Hull Shader Patch Constant Join Phase

    From the point of view of the HLSL code author and API user, the name for the Hull Shader compiler target is simply hs_5_0

    10.7 Hull Shader Control Point Phase Contents

    hs_control_point_phase(22.3.21) is a shader program with the following register model. Note the footnotes which provide a detailed discussion of output storage size calculations.

    Register Type Count r/w Dimension Indexable by r# Defaults Requires DCL
    32-bit Temp (r#) 4096 (r# + x#[n]) r/w 4 N None Y
    32-bit indexable Temp Array (x#[n]) 4096 (r# + x#[n]) r/w 4 Y None Y
    32-bit Input (v[vertex][element]) 32(element)*32(vert) r 4 Y None Y
    32-bit UINT Input vOutputControlPointID(23.7) 1 r 1 N None Y
    32-bit UINT Input PrimitiveID (vPrim) 1 r 1 N n/a Y
    Element in an input resource (t#) 128 r 128 Y None Y
    Sampler (s#) 16 r 1 Y None Y
    ConstantBuffer reference (cb#[index]) 15 r 4 Y None Y
    Immediate ConstantBuffer reference (icb[index]) 1 r 4 Y(contents) None Y
    Output Registers:
    32-bit output Vertex Data Element (o#) 32, see (1) below w 4 Y None Y

    (1) Each Hull Shader Control Point Phase output register is up to a 4-vector, of which up to 32 registers can be declared. There are also from 1 to 32 output control points declared, which scales amount of storage required. Let us refer to the maximum allowable aggregate number of scalars across all Hull Shader Control Point Phase output as #cp_output_max.

    #cp_output_max = 3968 scalars

    This limit happens to be based on a design point for certain hardware of 4096*32-bit storage here. The amount for Control Point output is 3968=4096-128, which is 32(control points)*4(component)*32(elements) - 4(component)*32(elements). The subtraction reserves 128 scalars (one control point) worth of space dedicated to the HS Phase 2 and 3, discussed below. The choice of reserving 128 scalars for Patch Constants (as opposed to allowing the amount to be simply whatever of the 4096 scalars of storage is unused by output Control Points) accommodates the limits of another particular hardware design. Note the Control Point Phase can declare 32 output control points, but they just can’t be fully 32 elements with 4 components each, since the total storage would be too high.

    10.7.1 System Generated Values input to the HS Control Point Phase

    InstanceID(8.18) and VertexID(8.16) can be input as long as the previous Vertex Shader stage outputs them.

    PrimitiveID(8.17) is also available as a scalar 32-bit integer input for each Control Point. PrimitiveID indicates the current patch in the Draw*() call, starting with 0. This PrimitiveID is the same value that the Geometry Shader would see for every patch if it input PrimitiveID - that is every point/line/triangle produced by the tessellator for a given patch has a single PrimitiveID for the entire Patch.

    OutputControlPointID(23.7) is a scalar 32-bit integer input for each Control Point identifying which one it is [0..n-1] given n declared output Control Points.


    10.8 Hull Shader Fork Phase Contents


    Section Contents

    (back to chapter)

    10.8.1 HS Fork Phase Programs
    10.8.2 HS Fork Phase Registers
    10.8.3 HS Fork Phase Declarations
    10.8.4 Instancing of an HS Fork Phase Program
    10.8.5 System Generated Values in the HS Fork Phase


    10.8.1 HS Fork Phase Programs

    There can be 0 or more Fork Phase programs present in a Hull Shader. Each of them declares its own inputs, but they come from the same pool of input data – the Control Points. Each Fork Phase program declares its own outputs as well, but out of the same output register space as all Fork Phase and Join Phase programs, and the outputs can never overlap.

    10.8.2 HS Fork Phase Registers

    The following registers are visible in the hs_fork_phase(22.3.23) model.

    The input resources (t#), samplers (s#), constant buffers (cb#) and immediate constant buffer (icb) below are all shared state with all other HS Phases. That is, from the API/DDI point of view, the Hull Shader has a single set of input resource state for all phases. This goes with the fact that from the API/DDI point of view, the Hull Shader is a single atomic shader; the phases within it are implementation details.

    Note the footnotes which provide a detailed discussion of output storage size calculations.

    Register Type Count r/w Dimension Indexable by r# Defaults Requires DCL
    32-bit Temp (r#) 4096 (r# + x#[n]) r/w 4 N None Y
    32-bit indexable Temp Array (x#[n]) 4096 (r# + x#[n]) r/w 4 Y None Y
    32-bit Input Control Points (vicp[vertex][element]) (pre-Control Point Phase) 32, see (1) below r 4(component)*32(element)*32(vert) Y None Y
    32-bit Output Control Points (vocp[vertex][element]) (post-Control Point Phase) 32, see (1) below r 4(component)*32(element)*32(vert) Y None Y
    32-bit UINT Input PrimitiveID (vPrim) 1 r 1 N n/a Y
    32-bit UINT Input ForkInstanceID(23.8) (vForkInstanceID) 1 r 1 N n/a Y
    Element in an input resource (t#) 128 r 128 Y None Y
    Sampler (s#) 16 r 1 Y None Y
    ConstantBuffer reference (cb#[index]) 15 r 4 Y None Y
    Immediate ConstantBuffer reference (icb[index]) 1 r 4 Y(contents) None Y
    Output Registers:
    32-bit output Patch Constant Data Element (o#) 32, see (2) below w 4 Y None Y

    (1) The HS Fork Phase’s Input Control Point register (vicp) declarations must be any subset, along the [element] axis, of the HS Control Point input (pre-Control Point phase). Similarly the declarations for inputting the Output Control Points (vocp) must be any subset, along the [element] axis, of the HS Output Control Points (post-Control Point Phase).

    Along the [vertex] axis, the number of control points to be read for each of the vicp and vocp must similarly be a subset of the HS Input Control Point count and HS Output Control Point count, respectively. For example, if the vertex axis of the vocp registers are declared with n vertices, that makes the Control Point Phase’s Output Control Points [0..n-1] available as read only input to the Fork Phase.

    (2) The HS Fork and Join phase outputs are a shared set of 4 4-vector registers. The outputs of each Fork/Join phase program cannot overlap with each other. System Interpreted values such as TessFactors(10.10) come out of this space.

    10.8.3 HS Fork Phase Declarations

    The declarations for inputs, outputs, temp registers, resource etc. in an HS Fork Phase program are like any standalone shader. A given HS Fork Phase program need only declare what it needs to read and write. Further, if it does not need to see all Input or Output Control Points, it can declare a subset of the counts for each, by declaring a smaller number on the [vertex] array axis than the corresponding number of Control Points actually available.

    There is not a way to declare that a sparse set of the Control Points is read. E.g. a shader that needs read Input Control Points [0],[3], [11] and [15] would just declare the Input Control Point (vicp) register’s [vertex] axis size as 16. Note that if references to the Control Points from shader code use static indexing, it will be obvious to drivers exactly what subset of Control Points is actually needed by the program anyway.

    10.8.4 Instancing of an HS Fork Phase Program

    Any individual HS Fork Phase program can be declared to execute instanced, with a declaration identifying a fixed instance count from 1 to 128 (128 is the maximum number of scalar Patch Constant outputs). The HS Fork Phase program executes the declared number of times per patch, with each instance identified by its 32-bit UINT input register vForkInstanceID(23.8).

    Note that if the role of an instanced Fork Phase program is for each instance to produce a System Interpreted Value(4.4.5), say one of the edge TessFactors(10.10) for a quad patch per instance, the declarations for each of those outputs would identify the System Interpreted Value being produced, just like any other shader.

    10.8.5 System Generated Values in the HS Fork Phase

    The HS Fork Phase can input PrimitiveID(8.17) in its own register just like the HS Control Point Phase. The value in this register is the same as what the HS Control Point Phase sees. The other special input register in the HS Fork Phase is vForkInstanceID(23.8), described previously.

    The system doesn’t go out of its way to automatically provide other System Generated Values(4.4.4) (VertexID(8.16), InstanceID(8.18)) to the HS Fork Phase. Values like these are part of the Input Control Points (if they were declared to be there) already, so the HS Fork phase can read VertexID/InstanceID by reading them out of the Input Control Points.

    The treatment of InstanceID(8.18) does seem strange, in that InstanceID would be the same for all Control Points in a Patch (indeed, unchanging across multiple patches), yet it shows up per-Input Control Point. However, this is consistent with the behavior elsewhere in the pipeline, where the first active stage that can input a System Generated Value (for InstanceID, that is the Vertex Shader) is responsible for passing the value down to the next stage via shader output (rather than the hardware feeding the value down to subsequent stages separately). For the Geometry Shader to see InstanceID, it also shows up in each input vertex there, just like it shows up in each Input Control Point in the Hull Shader.


    10.9 Hull Shader Join Phase Contents


    Section Contents

    (back to chapter)

    10.9.1 HS Join Phase Program
    10.9.2 HS Join Phase Registers
    10.9.3 HS Join Phase Declarations
    10.9.4 Instancing of an HS Join Phase Program
    10.9.5 System Generated Values in the HS Join Phase


    10.9.1 HS Join Phase Program

    There can be 0 or more Join Phase programs present in a Hull Shader. Each of them declares its own inputs, but they come from the same pool of input data – the Control Points as well as the Patch Constant outputs of the Fork Phase programs. Each Join Phase program declares its own outputs as well, but out of the same output register space as all Fork Phase and Join Phase programs, and the outputs can never overlap.

    10.9.2 HS Join Phase Registers

    The following registers are visible in the hs_join_phase(22.3.26) model. Note there are three sets of input registers: vicp (Control Point Phase Input Control Points), vocp (Control Point Phase Output Control Points), and vpc (Patch Constants). vpc are the aggregate output of all the HS Fork Phase programs(s). The HS Join Phase output o# registers are in the same register space as the HS Fork Phase outputs.

    The input resources (t#), samplers (s#), constant buffers (cb#) and immediate constant buffer (icb) below are all shared state with all other HS Phases. That is, from the API/DDI point of view, the Hull Shader has a single set of input resource state for all phases. This goes with the fact that from the API/DDI point of view, the Hull Shader is a single atomic shader; the phases within it are implementation details.

    Note the footnotes which provide a detailed discussion of output storage size calculations.

    Register Type Count r/w Dimension Indexable by r# Defaults Requires DCL
    32-bit Temp (r#) 4096 (r# + x#[n]) r/w 4 N None Y
    32-bit indexable Temp Array (x#[n]) 4096 (r# + x#[n]) r/w 4 Y None Y
    32-bit Input Control Points (vicp[vertex][element]) (pre-Control Point Phase) 32, see (1) below r 4(component)*32(element)*32(vert) Y None Y
    32-bit Output Control Points (vocp[vertex][element]) (post-Control Point Phase) 32, see (1) below r 4(component)*32(element)*32(vert) Y None Y
    32-bit Input (vpc[element]) (Patch Constant Data) 32, see (3) below r 4 Y None Y
    32-bit UINT Input PrimitiveID (vPrim) 1 r 1 N n/a Y
    32-bit UINT Input JoinInstanceID(23.9) (vJoinInstanceID) 1 r 1 N n/a Y
    Element in an input resource (t#) 128 r 128 Y None Y
    Sampler (s#) 16 r 1 Y None Y
    ConstantBuffer reference (cb#[index]) 15 r 4 Y None Y
    Immediate ConstantBuffer reference (icb[index]) 1 r 4 Y(contents) None Y
    Output Registers:
    32-bit output Patch Constant Data Element (o#) 32, see (2) below w 4 Y None Y

    (1) The HS Join Phase’s Input Control Point register (vicp) declarations must be any subset, along the [element] axis, of the HS Control Point input (pre-Control Point phase). Similarly the declarations for inputting the Output Control Points (vocp) must be any subset, along the [element] axis, of the HS Output Control Points (post-Control Point Phase).

    Along the [vertex] axis, the number of control points to be read for each of the vicp and vocp must similarly be a subset of the HS Input Control Point count and HS Output Control Point count, respectively. For example, if the vertex axis of the vocp registers are declared with n vertices, that makes the Control Point Phase’s Output Control Points [0..n-1] available as read only input to the Join Phase.

    (2) The HS Fork and Join phase outputs are a shared set of 4 4-vector registers. The outputs of each Fork/Join phase program cannot overlap with each other. System Interpreted values such as TessFactors(10.10) come out of this space.

    (3) In addition to Control Point input, the HS Join phase also sees as input the Patch Constant data computed by the HS Fork Phase program(s). This shows up at the HS Fork phase as the vpc# registers. The HS Join Phase’s input vpc# registers share the same register space as the HS Fork Phase output o# registers. The declarations of the o# registers must not overlap with any HS Fork phase program o# output declaration; the HS Join Phase is adding to the aggregate Patch Constant data output for the Hull Shader.

    10.9.3 HS Join Phase Declarations

    The declarations for inputs, outputs, temp registers, resource etc. in an HS Join Phase program function the same was as HS Fork Phase declarations(10.8.3).

    10.9.4 Instancing of an HS Join Phase Program

    Any individual HS Join Phase program can be declared to execute instanced, with a declaration identifying a fixed instance count from 1 to 128 (128 is the maximum number of scalar Patch Constant outputs). The HS Join Phase program executes the declared number of times per patch, with each instance identified by its 32-bit UINT input register vJoinInstanceID(23.9).

    Note that if the role of an instanced Join Phase program is for each instance to produce a System Interpreted Value(4.4.5), say one of the inside TessFactors(10.10) for a quad patch per instance, the declarations for each of those outputs would identify the System Interpreted Value being produced, just like any other shader.

    10.9.5 System Generated Values in the HS Join Phase

    System Generated Values are dealt with the same(10.8.5) way in the HS Join Phase as the HS Fork Phase. Instead of vForkInstanceID(23.8), in the Join Phase the same thing is called vJoinInstanceID(23.9). PrimitiveID(8.17) is available a standalone input register.


    10.10 Hull Shader Tessellation Factor Output


    Section Contents

    (back to chapter)

    10.10.1 Overview
    10.10.2 Tri Patch TessFactors
    10.10.3 Quad Patch TessFactors
    10.10.4 Isoline TessFactors


    10.10.1 Overview

    Hull Shader(10) Fork and Join Phase code can declare up to 6 of their output scalars as System Interpreted Values that identify various Tessellation Factors, driving how much tessellation the fixed function Tessellator should perform. For example, on a Quad there are 4 TessFactors for the edges, as well as 2 for the inside. HLSL exposes alternative (helper) ways to generate the inside tessfactors automatically from the edge TessFactors, e.g. deriving them by min/max/avg on the edge values, and possibly scaling based on user-provided scale values. The hardware does not understand anything about this helper processing (it just appears as shader code)

    The optional (from the HLSL author point of view) tessellation factor processing results in HLSL compiler autogenerated shader code in either or both of the Fork and Join Phases. This standard processing can involve cleaning up of values, handling of special low TessFactor cases to prevent popping, and rounding of the values depending on the tessellation mode.

    The final Tessellation Factors after this processing go to the fixed function Tessellator hardware – TessFactors for each edge and explicit TessFactors for the patch inside (as opposed to TessFactorScale the user specifies).

    Downstream, Domain Shader(12) code may be interested in seeing all of the intermediate values generated during any optional TessFactor processing. For example, to be able to perform blending during Pow2 Partitioning tessellation, one might want to see the ratio between unrounded and rounded TessFactor values. To enable that, the auto-generated code in the Fork and/or Join Phases will output not only final TessFactor values for the tessellator, but also the intermediate values, so the Domain Shader can access them. There are at most 12 such additional values (in the case of a Quad Patch). Again, the hardware does not understand anything about these "helper" values, and they are not discussed in detail here.

    The next sections describe just the TessFactors relevant to the hardware without discussing the various optional helper routines that HLSL provides to derive them.

    Further information about how Tessellation Factors are interpreted is here(11.7.10).

    10.10.2 Tri Patch TessFactors

    float3 SV_TessFactor(24.8)

    The first component provides the TessFactor for the U==0 edge of the patch.

    The second component provides the TessFactor for the V==0 edge of the patch.

    The third component provides the TessFactor for the W==0 edge of the patch.

    The above hardware/system interpreted values must be declared in the same component of 3 consecutive registers (since indexing is on that axis).

    float SV_InsideTessFactor(24.9)

    This determines how much to tessellate the inside of the tri patch.

    10.10.3 Quad Patch TessFactors

    float4 SV_TessFactor(24.8)

    The first component provides the TessFactor for the U==0 edge of the patch.

    The second component provides the TessFactor for the V==0 edge of the patch.

    The third component provides the TessFactor for the U==1 edge of the patch.

    The fourth component provides the TessFactor for the V==1 edge of the patch.

    The ordering of the edges is clockwise, starting from the U==0 edge (visualized as the "left" edge of the patch).

    The above hardware/system interpreted values must be declared in the same component of 4 consecutive registers (since indexing is on that axis).

    float2 SV_InsideTessFactor(24.9)

    The first component determines how much to tessellate along the U direction of the inside of the patch.

    The second component determines how much to tessellate along the V direction of the inside of the patch.

    10.10.4 Isoline TessFactors

    float2 SV_TessFactor(24.8)

    The first component destermines the line density (how many tessellated parallel lines to generate in the V direction over the patch area).

    The second component determines the line detail (how finely tessellated each of the parallel lines is, in the U direction over the patch area).

    The above hardware/system interpreted values must be declared in the same component of 2 consecutive registers (since indexing is on that axis).

    IsoLines are discussed further here(11.6)

    10.11 Restrictions on Patch Constant Data

    The Hull Shader output Patch Constant data appears as 32 vec4 elements. The placement of the Final TessFactors are constrained as described in the previous sections – each grouping of TessFactors must appear in a specific order in the same component of consecutive registers/elements in the Patch Constant Data. E.g. For Quad Patches, the four Final Edge TessFactors in a fixed order make up one grouping, and the two Final Inside TessFactors in a fixed order make up another separate grouping.

    Shader indexing of the Patch Constant data across the 32 vec4 elements is restricted, due to the limitations of a particular hardware implementation, as follows:

    10.12 Shader IL "Ret" Instruction Behavior in Hull Shader

    Since the Hull Shader has multiple phases, each of which can be instanced (e.g. multiple Control Points in the Control Point phase, or instanced Fork or Join Phases), the "ret*" (return(22.7.16) or conditional return(22.7.17)) shader instruction is defined to end only the current instance of the current phase. So a "ret*" in the Control Point Phase would only finish the current Control Point invocation without affecting the others or other phases. Or a "ret*" in a Fork or Join Phase program would only end that instance of that program without affecting other instances (if it is instanced) or other Fork/Join programs.

    10.13 Hull Shader MaxTessFactor Declaration

    The HS State Declaration Phase can optionally include a fixed float32 MaxTessFactor(22.3.20) in the range {1.0...64.0}.

    This MaxTessFactor declaration(22.3.20) is useful when application knows the maximum amount of tessellation it could possibly ask for through the TessFactor values will output from the Hull Shader. Communicating this knowledge to the device allows it to optionally take advantage and perform better scheduling of resources on the GPU.

    If a MaxTessFactor is declared, it is enforced by HLSL autogenerated TessFactor clamping code as the last step in the calculation of all of the following hardware System Interpreted Values (whose meanings were described earlier):

    SV_TessFactor

    SV_InsideTessFactor

    For simplicity only a single MaxTessFactor value can be declared, and when it is present, it is applied to all the TessFactors listed above.

    The device sees the MaxTessFactor declaration as a part of the Hull Shader. The knowledge of this declaration is what hardware can optionally take advantage of to optimize Tessellation performance for content going through that Hull Shader, versus an otherwise identical Hull Shader without the declaration.

    If HLSL fails to enforce the MaxTessFactor when it is declared (by clamping the HS output TessFactors), and a TessFactor larger than MaxTessFactor arrives at the Tessellator, the Tessellator’s behavior is undefined. Hitting this undefined situation is a Microsoft HLSL compiler (or driver compiler) bug, not the fault of the shader author or hardware.

    Note that independent of this optional application-defined MaxTessFactor, the Tessellator always performs some additional basic clamping and rounding of Final TessFactors as appropriate for the situation, described later (5.5). Those manipulations guarantee the hardware behavior by limiting the range of inputs possible. The only exception to that well defined hardware interface is this MaxTessFactor declaration which must rely on HLSL to generate code to enforce it. The reason it is the responsibility of HLSL to enforce consistency in this one case is it was too late in the spec process to arrive at any consistent hardware definition here, either by defining what the hardware behavior is if MaxTessFactor was not enforced but then exceeded at runtime, or getting all hardware vendors enforce the same MaxTessFactor clamping in hardware.


    11 Tessellator


    Chapter Contents

    (back to top)

    11.1 Tessellation Introduction
    11.2 Tessellation Pipeline
    11.3 Input Assembler and Tessellation
    11.4 Tesellation Stages
    11.5 Fixed Function Tessellator
    11.6 IsoLines
    11.7 Tessellation Pattern
    11.8 Enabling Tessellation


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    11.1 Tessellation Introduction

    The tessellation model processes a patch at a time, either a quad, tri or "isoline" domain, and does not embody any specific surface representation. It strictly generates domain locations that are fed to a programmable shader (Domain Shader(12)) that is responsible for generating positions and any ancillary shading information (texture coordinates, tangent frames, normals, etc.). The domain locations are water tight across a boundary if identical level of detail is used, otherwise the hardware plays no role in ensuring crack free surfaces. This specification does not cover any specific surface representation, or how to map representations to the given pipeline.

    11.2 Tessellation Pipeline

    Requirements

    See the D3D pipeline(2) diagram to see how Tessellation (Hull Shader(10), Tessellator(11) and Domain Shader(12)) fits in.

    11.3 Input Assembler and Tessellation

    The Input Assembler(8) has a new primitive topology called "patch list", which is accompanied by a vertex count per patch: [1..32]. This is also described under Patch Topologies(8.11).

    All existing IA behaviors work orthogonally with patches. i.e. indexing, instancing, DrawAuto etc.

    Incomplete patches are discarded – for example if the vertex count is 32 per patch, and a Draw call specifies 63 vertices, one 32 vertex patch will be produced, and the remaining 31 vertices will be discarded.

    11.4 Tesellation Stages

    Here are pointers to the stages involved in Tessellation, in the order of data flow:

    Hull Shader(10)

    Fixed Function Tessellator(11.5) (this chapter, below)

    Domain Shader(12)

    11.5 Fixed Function Tessellator

    This fixed function stage takes floating point TessFactor values as input and generates a tessellation of the domain. The domain can be tri, quad or isoLine (see next section for a definition of isoLines).

    The tessellator generates a couple of things:

    Note the domains are defined such that for isoLines and quads, the V direction is clockwise from the U direction. For tri domain, UVW are clockwise, in that order.

    Adjacency(8.15) information is not available when using the tessellator - only independent points, lines or triangles are generated. The order that points/lines/triangles and their vertices are produced must be invariant between similar tessellator invocations on the same device, but no explicit order is prescribed.

    11.6 IsoLines

    The isoLine domain is a specialized form of the quad domain. It is the only domain that can produce tessellated lines. For isoLines, the U direction over a quad domain is the direction tessellated lines are drawn (lines of constant V). There are two TessFactor(10.10.4) values:

    The first is the line density, which is always rounded to integer and determines how many U-parallel tessellated line segments to generate across the V direction. The spacing of these line segments across V is uniform, starting at V=0. So if the line density is 1, a single tessellated line is generated from (U=0,V=0) to (U=1,V=0). If the line density is 2, the first tessellated line is generated from (0,0) to (1,0) and the second tessellated line is generated from (0,0.5)-(1,0.5). Notice that no line is ever generated at V=1.

    The second TessFactor is the line detail, determining how much to tessellate each line of constant V.

    For more concrete info on the tessellation pattern for isolnes see IsoLine Pattern Details(11.7.8).


    11.7 Tessellation Pattern


    Section Contents

    (back to chapter)

    11.7.1 Overview
    11.7.2 Tessellation Pattern Overview
    11.7.3 Fractional Partitioning

    11.7.3.1 Fractional Odd Partitioning
    11.7.3.2 Fractional Even Partitioning
    11.7.4 Splitting Vertices on an Edge
    11.7.5 Which Vertices to Split
    11.7.6 Triangulation
    11.7.6.1 Transitions
    11.7.6.2 Triangulating Picture Frame Sides
    11.7.7 Integer Partitioning
    11.7.7.1 Pow2 Partitioning
    11.7.8 IsoLine Pattern Details
    11.7.9 Primitive Ordering
    11.7.9.1 Tessellator PrimitiveID
    11.7.10 TessFactor Interpretation
    11.7.11 TessFactor Range
    11.7.11.1 HS MaxTessFactor Declaration
    11.7.11.2 Hardware Edge TessFactor Range Clamping
    11.7.11.3 Hardware Inside TessFactor Range Clamping
    11.7.12 Culling Patches
    11.7.13 Tessellation Parameterization and Watertightness
    11.7.14 Tessellation Precision
    11.7.15 Tessellator State Specified Via Hull Shader Declarations


    11.7.1 Overview

    Details of the point placement and connectivity described in words in this section.

    A more concrete description can be found in the reference fixed function tessellator code, entirely encapsulated in the following C++ files:

    tessellator.hpp(outside link)

    tessellator.cpp(outside link)

    11.7.2 Tessellation Pattern Overview

    The inside of a triangle/quad patch is a tessellated triangle/square based on a specified InsideTessFactor(s). For a triangle, there is a single TessFactor(10.10.2) for the inside region of the patch. For a quadrilateral, there are 2 inside TessFactors(10.10.3).

    HLSL exposes helpers that can optionally derive inside TessFactors from the edge TessFactors (these amount to shader code, so the hardware doesn't need to know about them). For example in the case of a quad patch, the helpers have a couple of options for deriving inside TessFactors – 1-axis and 2-axis. In the 1-axis mode, the inside TessFactor reduction is applied on all 4 edges producing a single inside TessFactor. In the 2-axis mode, the reduction from 4 edge TessFactors is divided into two separate parts. The V==0 and V==1 edge TessFactors are reduced to a single TessFactor for the V direction of the interior. Similarly the U==0 and U==1 TessFactors are reduced to a single TessFactor for the U direction on the interior.

    The boundaries of the patch transition between the inside TessFactor(s) and each per-edge TessFactor.

    There are two basic flavors of fractional tessellation: either using an even number of segments (intervals) on an edge or an odd number. When using an even number of segments the coarsest an edge can be refined is to have two segments an edge, so it is impossible to model a level of detail with a single segment.

    For integer partitioning, TessFactors are rounded to integer. The parity (even/odd) of each edge and inside TessFactor after rounding determines how that area is tessellated: an odd integer TessFactor matches odd fractional tessellation at the same TessFactor. Similarly, an even integer TessFactor matches even fractional tessellation at the same TessFactor.

    For pow2 partitioning, TessFactors are rounded to a power of 2, and tessellation of pow2 TessFactors matches even fractional tessellation at the same TessFactor, but in addition the power of 2 mode can go down to 1 segment on any side (1 is a power of 2). From the hardware point of view there is no distinction between pow2 and integer - the hardware doesn't do the rounding of the TessFactors to pow2. That rounding is the responsibility of the HLSL compiler, given the shader being authored using the appropriate helper intrinsics in shader code (not discussed here).

    11.7.3 Fractional Partitioning

    11.7.3.1 Fractional Odd Partitioning

    11.7.3.2 Fractional Even Partitioning

    11.7.4 Splitting Vertices on an Edge

    11.7.5 Which Vertices to Split

    Why Split Like This?

    11.7.6 Triangulation

    11.7.6.1 Transitions

    Mapping Vertices to Texels 1:1 in an Application

    Tri vs Quad Density Comparison

    11.7.6.2 Triangulating Picture Frame Sides

    11.7.7 Integer Partitioning

    11.7.7.1 Pow2 Partitioning

    Example: Displacement Mapping

    11.7.8 IsoLine Pattern Details

    11.7.9 Primitive Ordering

    The order that geometry is generated for a patch must be repeatable on a device, however no particular ordering of the geometry within a patch is prescribed. A strict requirement is that all geometry for a given patch flows down the pipeline before any geometry for subsequent patches.

    Suppose the rasterizer is the next active stage in the pipeline after tessellation, and there are vertex attributes that are declared in the Pixel Shader with constant interpolation. The leading vertex, used to provide the constant attribute for any individual line or triangle, can be any of the vertices in the line or triangle (albeit repeatable for a given patch and tessellator configuration on a device).

    11.7.9.1 Tessellator PrimitiveID

    When a patch topology is used, PrimitiveID(8.17) identifies which patch in the Draw*() call is being processed, starting from the Hull Shader onward. Even though tessellation may produce multiple points/lines/triangles, for a given patch, all of the primitives generated for it have the same PrimitiveID. As such, the freedom of point/line/triangle ordering within a patch is not visible to shader code. When a patch topology is used, the true "primitive" is the patch itself.

    11.7.10 TessFactor Interpretation

    The TessFactor number space roughly corresponds to how many line segments there are on the corresponding edge. This isn’t a precise definition of the number of segments because different tessellation modes snap to different numbers of segments (i.e. integer versus fractional_even versus fractional_odd).

    For integer partitioning, TessFactor range is [1 ... 64] (fractions rounded up).

    For pow2 partitioning, TessFactor range is [1,2,4,8,16,32,64]. Anything outside or in between values in this set is rounded to the next entry in the set by HLSL code... so from the hardware point of view, pow2 partitioning technically isn't different from integer partitioning.

    For fractional odd partitioning, TessFactor range is [1 ... 63]. Odd TessFactors produce uniform partitioning of the space. Other TessFactors in the range produce a segment count that is the next odd TessFactor higher, transitioning the point locations based on the distance between the nearest lower odd TessFactor and nearest greater odd TessFactor.

    For fractional even tessellation, TessFactor range is [2 ... 64]. Even TessFactors produce uniform partitioning of the space. Other TessFactors in the range produce a segment count that is the next even TessFactor higher, transitioning the point locations based on the distance between the nearest lower even TessFactor and nearest greater even TessFactor.

    For the IsoLine domain, the line detail TessFactor honors all the above modes. However the line density TessFactor always behaves as integer – [1 ... 64] (fractions rounded to next).

    11.7.11 TessFactor Range

    11.7.11.1 HS MaxTessFactor Declaration

    This particular clamp on TessFactors is discussed here(10.13), and is independent of the hardware clamps defined in the rest of this section.

    11.7.11.2 Hardware Edge TessFactor Range Clamping

    The following describes the float32 patch edge TessFactor range that the hardware Tessellator must accept from the Hull Shader.

    First of all, if any edge TessFactor is <= 0 or NaN, the patch is culled.

    Otherwise, hardware must clamp each edge input TessFactor to the range specified below.

    Partitioning Min Edge TessFactor Max Edge TessFactor Comments
    Even_Fractional 2 64
    Odd_Fractional 1 63
    Integer (Pow2 maps to integer in hardware) 1 64 After clamping, round result to next integer.

    For IsoLines, the LineDensity Tessfactor (which is how many constant V iso-lines to draw) is clamped by the hardware to [1...64] and rounded to the next integer.

    11.7.11.3 Hardware Inside TessFactor Range Clamping

    In addition to patch edge TessFactors, hardware will be given inside TessFactors from the Hull Shader. There are two inside TessFactors for quad patches (U and V axes), and one inside TessFactor for tri patches.

    These HS outputs may have been derived (optinally) from the edge TessFactors via some operation such as max or avg in Hull Shader code autogenerated by HLSL. This derivation may involve low TessFactor fixups to prevent popping as TessFactors transition through extreme cases. Such processing is just shader code, irrellevant to the hardware.

    For the final inside TessFactors coming out of the Hull Shader, the following is pseudocode for the hardware validation hardware must do, effectively creating safe bounds on the complexity of cases the hardware tessellation algorithm has to handle.

        // Compute HWInsideTessFactorU/V for quad patch (similar tri patch case has only one axis),
        // given HSOutputInsideTessFactorU/V + 4 edge TessFactors.
        // This is just the fixed function hardware processing, independent of shader pre-conditioning
        // of the TessFactors (which the hardware does not need to know about).
        float lowerBound, upperBound;
        switch(partitioning)
        {
            case integer:
            case pow2: // don’t care about pow2 distinction for validation, just treat as integer
                lowerBound = 1;
                upperBound = 64;
                break;
    
            case even_fractional:
                lowerBound = 2;
                upperBound = 64;
                break;
    
            case odd_fractional:
        #define EPSILON 0.0000152587890625 // 2^(-16), min positive fixed point fraction
                if( any TessFactor, edge or inside is greater than (1.0 + EPSILON/2) )
                {
                    // If any Tessfactor will be > 1 after rounding during
                    // the float to fixed point conversion that happens later
                    // then make all inside TessFactors > 1.
                    lowerBound = 1.0 + EPSILON;
                }
                else // all are <= 1.0f or NaN
                {
                    lowerBound = 1;
                }
                upperBound = 63;
                break;
        }
    
        HWInsideTessFactorU = min( upperBound, max( lowerBound, HSOutputInsideTessFactorU ) );
        HWInsideTessFactorV = min( upperBound, max( lowerBound, HSOutputInsideTessFactorV ) );
        // A tri patch only has one insideTessFactor instead of U/V
        // Note the above clamps map NaN to lowerBound based on D3D/IEEE754R min/max rules
    
        if( integer or pow2 partitioning )
        {
             round HWInsideTessFactorU to next integer (don’t care about pow2 distinction for validation)
             round HWInsideTessFactorV to next integer
             // tri patch only has one insideTessFactor instead of U/V
        }
    
        // After this, all TessFactors are converted to .16 fixed point using D3D float->fixed
        // conversion rules(3.2.4.1) (incl round-to-nearest-even).  Topology and domain coordinate placement
        // is done based on the fixed point TessFactors.
    

    11.7.12 Culling Patches

    If any of the edge TessFactors from the HS for a patch are <= 0 or NaN, the patch is culled. No Domain Shader invocations or anything later in the pipeline are produced for that patch.

    A discussion elsewhere about enabling and disabling(11.8) of tessellation discusses how patch culling interacts with tessellation disabled, but patches being streamed out to memory.

    11.7.13 Tessellation Parameterization and Watertightness

    A shared edge has to generate identical domain locations for crack free tessellation to be possible. Domain Shader authors are responsible for achieving this, given some guarantees from the hardware. First, hardware tessellation on any given edge must always produce a distribution of domain points symmetric about the edge based on the TessFactor for that edge alone. Second, the parameterization of each domain point (U/V for quad or U/V/W for tri) must produce “clean” values in the space [0.0 ... 1.0]. “Clean” means that given a domain point on one side of the edge, with the parameter for that edge (say it is U) in [0 ... 0.5], the mirrored domain point produced on the other side, call it U' in [0.5 ... 1.0] will have a complement satisfying (1-U') == U exactly.

    Even if a neighboring patch sharing an edge happens to produce a complementary parameterization (U moving in the other direction, and/or U/V swapped), both side’s parameterization for each shared edge domain point will be equivalent because they are clean.

    Having clean parameterization means that DS authors can write domain point evaluation algorithms with a carefully constructed order of operations that is guaranteed to produce the same result even if the control points for the patch are traversed in reverse order and/or with the parameter space complemented.

    11.7.14 Tessellation Precision

    Tessellator input float32 TessFactor values are immediately converted to fixed point. Note this is after float processing of TessFactors, such as Inside TessFactor derivation has been done by HLSL generated shader code in HS Patch Constant Fork or Join Phases. Once the final TessFactors have been converted to fixed point, all remaining tessellator arithmetic (computing domain locations), is performed using fixed point arithmetic with 16 bits of fraction. The last step in domain point coordinate calculation is to convert the coordinates back to float32 for input to the Domain Shader.

    The fact that output U/V/W domain coordinates(23.10) have been quantized to 16 bit fixed point means there is a uniform spacing of representable values across the [0...1] range. This uniform spacing facilitates the symmetry and watertightness issues discussed above.

    Due to the fixed point arithmetic involved, it is possible for the tessellator to produce degenerate lines or triangles, where each vertex has identical domain coordinates. This will not be visible if the primitives are sent to the rasterizer, because they will be culled. However, if the Geometry Shader and/or Stream Output are enabled, the degenerate primitives will appear, and it is the application’s responsibility to be robust to this. For example, Geometry Shader code could check for and discard degenerates if that turns out to be the only way to avoid the algorithm being used from falling over on the degenerate input.

    If the Tessellator’s output primitive is points (as opposed to triangles or lines), this scenario requires only unique points within a patch to be generated. The one exception is points that are on the threshold of merging, if TessFactors were to incrementally decrease, may appear in the system as duplicated points (with the same U/V coords) in an implementation dependent way.

    What does 16-bit fixed point math for the domain coordinate generation mean?

    Suppose a single patch is drawn 64 meters wide.

    There is enough precision to place points at 2 mm resolution.

    11.7.15 Tessellator State Specified Via Hull Shader Declarations


    11.8 Enabling Tessellation


    Section Contents

    (back to chapter)

    11.8.1 Final D3D11 Definition for Enabling Tessellation

    11.8.1.1 Sending Un-Tessellated Patches to the Geometry Shader
    11.8.1.2 Sending Un-Tessellated Patches to NULL GS + Stream Output
    11.8.1.3 Sending Un-Tessellated Patches to the Rasterizer


    11.8.1 Final D3D11 Definition for Enabling Tessellation

    The presence of both a Hull Shader and Domain Shader enables tessellation. When a Hull Shader and Domain Shader are bound, the Input Assembler topology is required to be a patch type (otherwise behavior is undefined). If a Hull Shader is bound and no Domain Shader is bound, or vice versa, the behavior is undefined.

    Patches can be used at the Input Assembler without tessellation (no Hull Shader or Domain Shader), as long as the Geometry Shader and/or Stream Output are being used.

    11.8.1.1 Sending Un-Tessellated Patches to the Geometry Shader

    When tessellation is disabled (no Hull Shader and no Domain Shader bound), patches arriving at the Geometry Shader cause the GS to be invoked once per patch. Each GS invocation sees all the Control Points of the patch as an array of input vertices.

    Allowing the GS to be invoked with patches allows it to effectively input non-traditional topologies (beyond points, lines, triangles). E.g. to invoke the GS with a cube as its input primitive, one could send 8 Control Point patches.

    The GS does not support output of patches. The output of the GS remains one of: point list, line strips or triangle strips.

    11.8.1.2 Sending Un-Tessellated Patches to NULL GS + Stream Output

    Sending un-tessellated patches to NULL GS + Stream Output is valid. This enables, for example, Control Points that have gone through the Vertex Shader to be streamed out for multi-pass or reuse scenarios. Note, however, it is not possible for Hull Shader outputs to be streamed out (or go into the GS) - the presence of the Hull Shader requires a simultaneous Domain Shader and enables Tessellation – both of which consumes Hull Shader output entirely.

    When un-tessellated patches arrive at Stream Output, each Control Point in the patch appears as a single vertex for Stream Output. This definition is similar to the way NULL GS + Stream Output behaves with traditional primitive topologies such as triangle lists. As with other primitive types, only complete patches get written out; if there is not enough room to store a complete patch, it is discarded.

    11.8.1.3 Sending Un-Tessellated Patches to the Rasterizer

    It could have been defined that Control Points arriving at the rasterizer are interpreted as points and rasterized as such, but that would have required a RenderTarget-space projected "position" to be present in the control points, and the application would have to have wanted to draw them as points. This is an extremely unlikely scenario, not worth targeting. Therefore, if an un-Tessellated patch arrives at the Rasterizer, behavior is undefined and the debug runtime will call this out as an error.

    Original Definition for Enabling Tessellation

    The behaviors described so far in this section are the result of making cutbacks from the originally defined behavior. The cutbacks were made due to concerns over how the design was unfriendly to certain choices of D3D11 hardware implementations, resulting in among other issues unreasonable hardware and driver complexity.

    The original behavior is documented below for the sake of history,formatted like this. It is a superset of the final behavior above, so a lot of the content appears the same. Briefly, the most interesting extra bit of functionality was being able to pass Hull Shader outputs to GS/StreamOutput without tessellation. Tessellation was enabled only by the presence of a Domain Shader (which then required a Hull Shader). Without a Domain Shader, tessellation was disabled, but he Hull Shader could still be present, outputting control points downstream.

    Enabling Tessellation (this crossed out text is no longer representative of D3D11)

    The presence of a Domain Shader enables tessellation. When a Domain Shader is bound, the Input Assembler topology is required to be a patch type, and a Hull Shader must also be bound, otherwise the behavior is undefined (debug error).

    The absence of a Domain Shader disables tessellation. The Input Assembler topology is still allowed to be a patch type when tessellation is disabled. The following subsections describe what this means.

    Sending Un-Tessellated Patches to the Geometry Shader

    When tessellation is disabled, patches arriving at the Geometry Shader (with or without a Hull Shader Present) cause the GS to be invoked once per patch. Each GS invocation sees all the Control Points of the patch as an array of input vertices. Patch Constant data from the Hull Shader, such as Tessellation Factors, are not visible to the GS.

    Allowing the GS to be invoked with patches allows it to effectively input non-traditional topologies (beyond points, lines, triangles). E.g. to invoke the GS with a cube as its input primitive, one could send 8 Control Point patches.

    Sending Un-Tessellated Patches to Null GS + Stream Output

    Sending Un-Tessellated Patches to NULL GS + Stream Output is valid. This enables, for example, Control Points that have gone through the Vertex Shader and/or Hull Shader to be streamed out for multi-pass or reuse scenarios.

    Each Control Point in the patch appears as a single vertex for Stream Output. This definition is similar to the way NULL GS + Stream Output behaves with traditional primitive topologies such as triangle lists. As with other primitive types, only complete patches get written out; if there is not enough room to store a complete patch, it is discarded.

    If the HS is active, that means the HS output Control Points can be streamed out. Without the HS active, the VS output for each Control Point in a patch can be streamed out.

    Patch Constant data output by the Hull Shader, such as Tessellation Factors, are not available to Stream Output. As a workaround, an application that needs to stream out Patch Constant data could set up the tessellator to run, but then have the Domain Shader flag for discarding (such as assigning a bad vertex position) all but the first n domain points for the patch. The n domain points (where n is chosen to fit all the Patch Constant data across n vertices’ storage) would save out all the patch data from the Domain Shader. The GS/Stream Output could then send the data to memory as a sequence of individual points.

    If the HS culls a patch (by specifying an edge Tessellation factor <= 0) when tessellation is disabled, the "cull" has no effect on Stream Output of the patch. This choice was made because it is deemed not worth defining that the Stream Output stage must be able to interpret some Patch Constant data (TessFactors) to make a decision about what to stream out. Thus if un-tessellated patches are being sent to Stream Output, there is no way to cull them.

    Sending Un-Tessellated Patches to the Rasterizer

    It could have been defined that control points arriving at the rasterizer are interpreted as points and rasterized as such, but that would have required a RenderTarget-space projected "position" to be present in the control points, and the application would have to have wanted to draw them as points. This is an extremely unlikely scenario, not worth targeting. Therefore, if an un-Tessellated patch arrives at the Rasterizer, behavior is undefined and the debug runtime will call this out as an error.


    12 Domain Shader Stage

    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    For a Tessellation overview, see the Tessellator(11) section.

    12.1 Domain Shader Instruction Set

    The Domain Shader instruction set is listed here(22.1.5).

    12.2 Domain Shader Contents

    Inputs for this stage are the 2D or 3D domain location(23.10) generated by the tessellator(11) and all of the data generated by the Hull Shader(10). This latter data is visible to all domain points in a patch. In all other ways this shader is effectively analogous to a Vertex Shader(9).

    12.2.1 Domain Shader Invocation

    The Domain Shader can see all the data output by both phases of the Hull Shader, as well as the domain location of a particular point. The Domain Shader is invoked for every domain location generated by the Tessellator.

    12.2.2 Domain Shader Registers

    The following registers are available in the ds_5_0 model.

    Register Type Count r/w Dimension Indexable by r# Defaults Requires DCL
    32-bit Temp (r#) 4096 (r# + x#[n]) r/w 4 N None Y
    32-bit indexable Temp Array (x#[n]) 4096 (r# + x#[n]) r/w 4 Y None Y
    32-bit Input Control Points (vcp[vertex][element]) 32, see (1) below r 4(component)*32(element)*32(vert) Y None Y
    32-bit Input Patch Constants (vpc[vertex]) 32, see (1) below r 4 Y None Y
    32-bit input location in domain (vDomain(23.10).xy, vDomain(23.10).xyz)) 1 r 3 N n/a Y
    32-bit UINT Input PrimitiveID (vPrim) 1 r 1 N n/a Y
    Element in an input resource (t#) 128 r 128 Y None Y
    Sampler (s#) 16 r 1 Y None Y
    ConstantBuffer reference (cb#[index]) 15 r 4 Y None Y
    Immediate ConstantBuffer reference (icb[index]) 1 r 4 Y(contents) None Y
    Output Registers:
    32-bit output Vertex Data Element (o#) 32 w 4 Y None Y

    (1) The domain shader sees the Hull Shader outputs in 2 separate sets of registers. The vcp registers can see all of the Hull Shader’s output Control Points. The vpc registers can see all of the Hull Shader’s Patch Constant output data.

    Since code for Hull Shader Patch Constant Fork or Join Phases output TessFactors using names such as SV_TessFactor, the DS must match those declarations on the equivalent vpc input if it wishes to see those values.

    12.2.3 System Generated Values in the Domain Shader

    InstanceID(8.18) and VertexID(8.16) can be input as long as the Hull Shader output these values (per-Control Point).

    The domain location is another System Generated Value, appearing in its own input register (vDomain(23.10)).

    The final set of System Values are the various TessFactors produced by the Hull Shader, discussed elsewhere(10.10). These can be declared as input out of part of the input Patch Constant (vpc) registers.


    13 Geometry Shader Stage


    Chapter Contents

    (back to top)

    13.1 Geometry Shader Instruction Set
    13.2 Geometry Shader Invocation and Inputs
    13.3 Geometry Shader Output
    13.4 Geometry Shader Output Data
    13.5 Geometry Shader Output Streams
    13.6 Geometry Shader Output Limitations
    13.7 Partially Completed Primitives
    13.8 Maintaining Order of Operations Geometry Shader Code
    13.9 Registers
    13.10 Geometry Shader Input Register Layout


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    13.1 Geometry Shader Instruction Set

    The Geometry Shader instruction set is listed here(22.1.6).

    13.2 Geometry Shader Invocation and Inputs

    When a Geometry Shader is active, it is invoked once for every primitive passed down or generated earlier in the Pipeline. Each invocation of the Geometry Shader sees as input the data for the invoking primitive, whether that is a single point, a single line, a single triangle, or the Control Points for a Patch (if a Patch arrives with Tessellation disabled). A triangle strip from earlier in the Pipeline would result in an invocation of the Geometry Shader for each individual triangle in the strip (as if the strip were expanded out into a triangle list). All the input data for each vertex in the individual primitive is available (i.e. 3 verts for triangle), plus adjacent vertex data if applicable/available. All vertex inputs/Element-layout/adjacency to be read must be declared, and this declaration must be compatible with the data being produced above in the Pipeline. Other inputs include textures, and also Primitive ID as a 32-bit scalar integer input .

    13.2.1 Geometry Shader Instancing

    An alternate method of invoking the Geometry Shader is via instancing. A GS Instancing declaration(22.3.7) specifies a (fixed) number of times for the GS to be invoked for each primitive. Each instance that executes is identified by a GS instance ID value [0...n-1], and the outputs of each GS instance are appended to the end of the outputs of the previous invocation (with an implicit cut of the topology between instances - see the description of cutting further below). The maximum instance count that can be declared is 32, but for a full explanation of constraints of GS instancing, see the description of the GS instancing declaration(22.3.7)

    Some background: The D3D10 Geometry Shader had a limit on the amount of vertex data that a single shader invocation can emit. The limit is 1024 scalars of data (fatter vertices means fewer vertices can be emitted). The shader program must statically declare the maximum amount of vertices it intends to output. It was desirable to relax this limit in some fashion.

    Another limitations of the D3D10 Geometry Shader design was the GS emits vertices is implicitly serial. e.g. if a GS program that wants to project an input triangle onto 6 cube faces, it must project to each cube face and emit geometry for each face one at a time. It was desirable to have a way a GS program could be authored to explicitly reveal to the hardware when the calculations to produce different batches of geometry form the same GS program are independent of each other. This way, hardware can execute each batch of vertex generation in parallel.

    13.2.1.1 Affect on GSInvocations Counter

    The GSInvocations Pipeline Statistics counter(20.4.7) reports the number of primitives input to the GS multiplied by the instance count per primitive. That is, each "instance" counts as a GSInvocation.

    13.3 Geometry Shader Output

    The Geometry Shader outputs data one vertex at a time using the "emit"(22.8.3) command. The topology of these vertices is determined by a fixed declaration(22.3.8), choosing one of: pointlist, linestrip, or trianglestrip as the output for the GS. Strips can be restarted by using the "cut"(22.8.1) command, which ends the current strip at the last emitted vertex, so that the next emitted vertex begins a new strip. The "emitThenCut"(22.8.5) instruction both emits a vertex, and stops the current strip on this vertex, so that the next vertex that is emitted begins a new strip. For pointlist output, "cut" has no effect (including the "cut" part of "emitThenCut").

    The outputs of a given invocation of the Geometry Shader are independent of other invocations (though ordering(4.2) is respected). A Geometry Shader emitting triangle strips will start a new strip on every invocation. In addition, as mentioned above, an invocation of the Geometry Shader can produce multiple separate strips using "cut"s.

    The Geometry Shader must declare the maximum number of vertices an invocation of the Shader will output. The total amount of data that a Geometry Shader invocation can produce is 1024 32-bit values. The calculation of the Stream Output record with one or more streams is as follows: Given that each stream declares its outputs in its own clean slate view of the full output register set, the total output record size is the number of scalars in the union of all the stream declarations. This size multiplied by the max output vertex count must not exceed 1024. When Geometry Shader instancing is used, the Stream Output record size restriction applies to each instance individually

    With only a single output stream, the above rule matches D3D10.

    The limit on Geometry Shader output is based on how many "emit" calls the Shader makes. The limit on Geometry Shader output is not affected in any way by the size of the output buffer(s) that are present or whether or not they have even been bound. Even if no output Buffers happen to be bound to a Stream and a vertex is output (and therefore dropped), it still counts against the limit.

    Hardware must enforce the limit above by stopping writes if the Geometry Shader program continues after emitting the declared maximum number of vertices.

    See the documentation of the GS maximum output vertex count declaration(22.3.5), as well as the GS Instancing declaration(22.3.7) for more details.

    13.4 Geometry Shader Output Data

    The o# registers to be written by the Geometry Shader must be declared (e.g. "dcl_output o[3].xyz"). The set of these declarations defines which registers are read when an "emit"(22.8.3) command is issued, defining a vertex. Therefore, all vertices emitted by the Geometry Shader have the same data layout.

    When a Geometry Shader output is identified as a System Interpreted Value(4.4.5) (e.g. "renderTargetArrayIndex" or "position"), hardware looks at this data and performs some behavior dependent on the value, in addition to being able to pass the data itself to the next Shader stage for input. When such data output from the Geometry Shader has meaning to the hardware on a per-primitive basis (such as "renderTargetArrayIndex" or "ViewportArrayIndex"), rather than on a per-vertex basis (such as "clipDistance" or "position"), the per-primitive data is taken from the Leading Vertex(8.14) emitted for the primitive.

    Each time an "emit"(22.8.3) or "emitThenCut"(22.8.5) is issued the contents of the declared Geometry Shader output registers are read to produce a vertex, and in addition the Geometry Shader outputs immediately become uninitialized. In other words, if any output data needs to be repeated for consecutive vertices, the Geometry Shader program must write the data over again to the output registers for each vertex.

    The Geometry Shader outputs have a close relationship to the Stream Output Stage/functionality, described here(14.3).

    13.5 Geometry Shader Output Streams

    13.5.1 Streams vs Buffers

    STREAM: For the discussion here, let us define a stream as a sequence of writes of a structure of data out of a shader. A Geometry Shader can output up to 4 streams, each at different rates, with independent data going to each stream. The utility of this is in conjunction with Stream Output(14).

    BUFFER: For the discussion in this section, in the context of Stream Output(14), a Buffer is a resource in memory that can receive any subset of the data from one stream. A stream can have its data split out (not replicated) across multiple buffers, and this mapping is defined by a Stream Output declaration (which is not visible in the Geometry Shader code). A Buffer cannot receive data from multiple streams at once.

    13.5.2 Multiple Output Streams

    4 streams can be declared(22.3.9) by the GS. Without the GS present, all vertex data is a single stream.

    When the GS defines multiple streams, variants of the "emit"(22.8.3), cut(22.8.1) or "emitThenCut"(22.8.5) instructions which take an immediate stream # [0..4-1] parameter must be used by the GS to indicate which stream is being output. These instructions are "emit_stream"(22.8.4), cut_stream(22.8.2) and "emitThenCut_stream"(22.8.6), respectively.

    From the point of view of the Geometry Shader, all the declarations of its output registers appear multiple times indepdendently, once per stream. A statement appears in the bytecode setting the current output stream being declared, and subsequent declarations of output registers define what data gets latched when vertex data is emitted to each stream. The set of output registers available to the GS program during execution is the union of all output registers declared for each stream (individual streams can use the same output registers). When a vertex is emitted to a given stream, only the output registers declared for that stream feed the output to the stream, however ALL declared output registers for all streams become uninitialized.

    If output register indexing is declared(22.3.30), specifying a range of output registers that can be dynamically indexed, the register space that can be declared for indexing is the union of all stream output register declarations.

    When outputting to multiple streams, the GS output topology declaration(22.3.8) must appear for each stream, and must bet set to "point". In other words, multiple streams means that non-point output is unavaliable.

    The points-only limitation with multi-stream output was a hardware limitation during the design. Perhaps in future DX releases this can be relaxed - that is to allow arbitrary topologies in each stream. An example would be to output triangles to one stream that goes to the rasterizer, while sending points to another stream that goes to Stream Output at a different frequency for compiling a list of coordinates to revisit with some postprocessing later. Or to render some triangles while saving off rejected ones.

    When outputting to only a single stream, the output from the GS can be a point list, line strip or triangle strip (strips are expanded to lists when streamed to memory), or a patch list. Output of a patch list from the GS is only valid for Stream Output, not for rasterization (undefined behavior).

    When outputting to multiple streams, one of them can be sent to the rasterizer (independently of whether it is also streaming to memory). The Stream Output declaration specifies this (outside the shader code, but appearing to the driver side by side). Interpolation modes, System Interpreted Values and System Generated Values can be declared on any combination of Streams in the Shader, but the only ones that have any meaning are the ones corresponding to the Stream (if any) declared (outside the shader) as going to the rasterizer (if any). For Streams that are not going to the Rasterizer, the names are ignored. Notice that the same shader could be created with different Stream Output declarations, each time selecting a different Stream to go to Rasterization.

    If a GS with streams is passed to CreateGeometryShader at the API/DDI (meaning there is no Stream Output declaration or rasterizer stream selection), the active stream defaults to 0. So stream 0 goes to rasterization if rasterization is enabled, and the absence of a Stream Output declaration means nothing is streamed out to memory. If the stream selected to go to rasterization isn’t declared in the GS or doesn’t include a position and rasterization is enabled, behavior is undefined, just as with any shader that feeds the rasterizer without a position.

    Sending one of the streams to rasterization with multiple streams isn't a particularly interesting feature for now, since in the multi-stream case all streams are point lists.

    Interpolation modes declared for the outputs on one Stream don’t have to match those on another Stream. Note that when the Geometry Shader is created, a choice of which stream (if any) is going to rasterization is made, so the driver shader compiler only needs to pay attention to interpolation modes and System Interpreted Values (such as "position") only on at most a single Stream’s declarations

    13.6 Geometry Shader Output Limitations

    When the application knows that some GS outputs will be treated as per-primitive constant at the subsequent Pixel Shader, the Geometry Shader need only initialize such output registers when they represent the Leading Vertex(8.14) for a primitive. For example, on the last 2 vertices in a triangle strip, outputs that (on Leading vertices) would have be treated as constant by the Pixel Shader need not be written. If Stream Output is being used, which has no knowledge of what data is per-primitive constant or not, in the expansion of GS output strips to lists, Stream Output simply dumps out all the declared outputs for each vertex for each primitive. If the GS chooses to not write out what it knows is non-Leading-Vertex data for Elements that will be used to drive per-primitive constants in a later pass, uninitialized data gets written to these unwritten Elements in Stream Output. This is fine as long as the application never attempts to later read such uninitialized Stream Output data. If the application later recirculates the Streamed Out data in a way that correctly interprets only per-primitive constant data at Leading Vertices and never interprets the uninitialized data at non-Leading-Vertices (even though it does get read back into the pipeline), no undefined behavior results.

    There is a mechanism to retrieve the number of output primitives in the output buffer. Further details regarding writing to memory from the Geometry Shader are described elsewhere in the spec.(14)

    13.7 Partially Completed Primitives

    Partially completed primitives could be generated by the the Geometry Shader if the Geometry Shader ends and the primitive is incomplete. Incomplete primitives are silently discarded and no counters are incremented. This is similar to the way the IA treats Partially Completed Primitives(8.13).

    13.8 Maintaining Order of Operations Geometry Shader Code

    To ensure consistent order of operations on an edge and primitive level for primitives that show up in multiple invocations of the Geometry Shader (as an adjacent primitive in some invocations, or the root primitive for one invocation), it is up to the application to write Shader code that traverses vertices in a consistent manner. This ordering can be obtained by a variety of methods, including simply sorting of vertices based on position in Shader code. A more robust ordering can be achieved by providing a vertex "coloring" (a number) as vertex attribute, such that for any primitive, the coloring is guaranteed to be unique for each vertex in the primitive. This method has the benefit that the sorting operation in the Geometry Shader is more efficient (and robust) than sorting xyz vertex positions. Colorings can be generated offline by an authoring tool.

    13.9 Registers

    The following registers are available in the gs_5_0 model:

    Register Type Count r/w Dimension Indexable by r#Defaults Requires DCL
    32-bit Temp (r#) 4096 (r# + x#[n]) r/w 4 n none y
    32-bit Indexable Temp Array (x#[n]) 4096 (r# + x#[n]) r/w4 y none y
    32-bit Input (v[vertex][element]) 32 r 4(comp)*32(vert) y none y
    32-bit Input Primitive ID (vPrim)1 r 1 n none y
    32-bit Input Instance ID (vInstanceID)1 r 1 n none y
    Element in an input resource (t#) 128 r 1 n none y
    Sampler (s#) 16 r 1 n none y
    ConstantBuffer reference (cb#[index]) 15 r 4 y(contents)none y
    Immediate ConstantBuffer reference (icb[index]) 1 r 4 y(contents)none y
    Output Registers:
    NULL (discard result, useful for ops with multiple results) n/a w n/a n/a n/a n
    32-bit output Vertex Data Element (o#) 32 w n/a n/a 4 y

    13.10 Geometry Shader Input Register Layout

    The Geometry Shader must declare which type of primitive it expects as input, out of the set of choices: {point,line,triangle,line_adj,triangle_adj,1-32 control point patch list}. The input primitive type specifies the number of vertices that are present, and the vertices are always fully indexed (there is no declaration for vertex indexing range). Even if strips are being used earlier in the Pipeline, individual primitives cause Geometry Shader Invocations. See the GS Input Primitive Declaration Statement(22.3.6) in the instruction reference.

    The following diagrams depict the layout of Geometry Shader Input Primitives into the input v# registers:




    14 Stream Output Stage


    Chapter Contents

    (back to top)

    14.1 Mapping Streams to Buffers
    14.2 Stream Output Buffer Declarations/Bindings
    14.3 Stream Output Declaration Details
    14.4 Current Stream Output Location
    14.5 Tracking Amount of Data Streamed Out
    14.6 Stream Output Buffer Bind Rules
    14.7 Stream Output Is Orthogonal to Rasterization


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    The Pipeline can stream vertices out to memory just before clipping and rasterization (even if rasterization is still enabled). Vertices are always written out as complete primitives (e.g. 3 vertices at a time for triangles); incomplete primitives are never written out.

    Just before Streaming Out, all topologies are always expanded to lists (i.e. if the topology is a triangle strip, it is expanded to a triangle list, having 3 vertices per primitive).

    If the Geometry Shader is active, it is capable of producing outputs with up to 32 Elements per-vertex (each Element up to 4 components) for the Rasterizer, any subset of which can be routed to Stream Output. The presence of the GS allows multiple streams to be generated as well, as described here(13.5).

    If the Geometry Shader is not active, whatever data arrives at the point in the pipeline where Stream Output appears (just before clipping and rasterization) can be Streamed Out (after expansion to a list topology as described above). Topologies with adjacency discard the "adjacent" vertices and only Stream Out the "interior" vertices. Patch topologies arriving at Stream Output can only go to Stream Output; the rasterizer must be disabled (undefined behavior otherwise).

    In the expansion of strips to lists of primitives on Stream Output from the Geometry Shader, there is no notion of any data being able to be treated as "constant"; for each Geometry Shader output primitive (after expansion from a strip to a list), the vertices each originate from separate "emit"(22.8.3) instructions. Applications can still take advantage of this behavior to store primitive data, simply by relying on the fact that if streamed out geometry is recirculated back into the Pipeline in another pass, the Rasterizer will treat the Leading Vertex(8.14) in each primitive as the source for attributes that are declared as constant by the Pixel Shader.

    14.1 Mapping Streams to Buffers

    A description of the distinction between a Stream and Buffer is given here(13.5). Up to 4 Streams can be present when the GS is used, otherwise there is a single Stream, Stream 0.

    Stream Output can send data from any Stream to up to 4 Buffers simultaneously. The total number of output Buffers across all Streams is also constrained to 4. Data from multiple Streams cannot go to a single Buffer, but each Stream can send its output to multiple Buffers. Stream data cannot be replicated across multiple buffers.

    Up to 128 scalar components of data per-vertex can be streamed out across the output Buffers, as long as the total window of data being output per-vertex to any one Buffer is 512 bytes or less. Vertex stride to a given Buffer can be up to 2048 bytes.

    14.2 Stream Output Buffer Declarations/Bindings

    The mapping of data from Streams to where they are written in output Buffers appears in a declaration outlined further below.

    14.2.1 Stream Output Formats

    In all cases, the only supported output data formats at Stream Output are 32-bit per component integer and floating point formats, with 1 to 4 components. This is not as general as the other Resource input/output paths in the D3D11.3 Pipeline. See the "Stream Output" column in the formats(19.1) table to see which formats can be used for Stream Output (all of which can of course be used at other parts of the D3D11.3 Pipeline for input). When any given 32-bit component of data in the Pipeline goes out the Stream Output path and gets written to memory, the hardware must simply dump out the 32 bits (per component) of data out unaltered, which is consistent with the sorts of formats supported for Stream Output described here.

    14.3 Stream Output Declaration Details

    The selection of which Elements to send to the Stream Output is tied to the Geometry Shader. When a Geometry Shader program is "Created" on the D3D11.3 Device, additional parameters can be passed into the "Create" call alongside the Geometry Shader code, describing both (a) what subset of data from the GS output to send to Stream Output for each of 1 to 4 Streams, (b) where to write the data to memory, (c) selection of 0 or 1 of the output Streams as going to the Rasterizer (indepdendent of it is going to Stream Output as well). If the Geometry Shader is not needed, but Stream Output functionality is desired, a "NULL" GS program can be specified, along with a Stream Output declaration for Stream 0 only, in which case whatever geometry reaches the GS stage of the pipeline gets Streamed out

    The vertices in one Stream reaching the point in the pipeline just before the Rasterizer/clipping can be sent both to the Rasterizer (if the Pixel Shader is active) as well as to Stream Output if it is active, simultaneously. The Pixel Shader can consume any subset of the data reaching it, while Stream Output can simultaneously select any other (possibly overlapping) subset of the data.

    The "NULL" GS + Stream Output scenario enables operations such as Streaming out the results of a VS. An application might wish to apply skinning to a vertex Buffer and save the results for reuse multiple times later. This may be accomplished by configuring a pipeline with a VS and a NULL GS (which just describes Stream Output). The vertex Buffer can be traversed by drawing a pointlist, in which case the VS will be invoked once for each vertex where skinning would be done, and then the Stream Output description can dump the result out to memory.

    The CreateGeometryShaderWithStreamOutput() DDI is defined roughly as follows (exact details will vary; IHVs should defer to the reference codebase). The API differs in a few ways from this DDI, such as hiding the concept of "registers" and "masks" appearing below, instead using string names for elements in a shader output signature, and component counts / offets to identify data within elements.

    typedef struct D3D11DDIARG_CREATEGEOMETRYSHADERWITHSTREAMOUTPUT
    {
        CONST DWORD*                                        pShaderCode;
        CONST D3D11DDIARG_STREAM_OUTPUT_STREAM*             pStreams;
        UINT                                                NumStreams;
        CONST UINT*                                         pBufferStrideInBytes;
        UINT                                                NumStrides;
    } D3D11DDIARG_CREATEGEOMETRYSHADERWITHSTREAMOUTPUT;
    
    
        pShaderCode             - The GS program.  This can be NULL, which
                                  means there is no GS, but stream output is
                                  being defined (NumEntries must be > 0).
    
        NumStreams              - How many Streams are being defined [0... 4]
                                  When set to 0, Stream Output is not being used
                                  (pShaderCode MUST have a GS in this case).
                                  A nonzero value defines the size of the Stream
                                  declaration array, pStreams.
    
        pBufferStrideInBytes    - Array for each output Buffer, the spacing between
                                  the beginning of each vertex during stream output.
                                  The stride value must be >= the declared size of the stream
                                  output structure (including gaps), up to 2048 bytes max.
                                  Any amount in excess of the size of the stream output
                                  structure is untouched in memory during stream output.
    
        NumStrides              - How many Buffers are being defined [0... 4]
    
    
    typedef struct D3D11DDIARG_STREAM_OUTPUT_STREAM
    {
        CONST D3D10DDIARG_STREAM_OUTPUT_DECLARATION_ENTRY*  pOutputStreamDecl;
        UINT                                                NumEntries;
        BOOL                                                StreamToRasterizer;
    } D3D11DDIARG_STREAM_STREAM;
    
        NumEntries              - Indicates how many entries are in the array at
                                  pStreamOutputDecl.  This must be > 0, and defines
                                  how many Elements (including gaps between Elements
                                  in memory that aren’t touched) are being defined for Stream
                                  Output, per-vertex.  Maximum count is 128 per Stream,
                                  with up to 4 Streams supported.
    
        pOutputStreamDecl     - Array of NumEntries instances of the
                                structure defined below.  This array defines a
                                contiguous sequence of up to 128 32-bit
                                components of memory to get written per-vertex during
                                Stream Output.  Each declaration entry defines up to
                                4 components that either (a) come from
                                one GS output register, or (b) are skipped (gap in
                                output).  Consecutive declaration entries define output
                                memory contiguous to the previous entry.
    
        StreamToRasterizer    - Whether this Stream is going to the Rasterizer.
                                Only one stream can have this set to true.  It is valid
                                for no stream to set this true.  If a Stream is going
                                to the Rasterizer, it can also be sent to Stream Output
                                as well (which is what pOutputStreamDecl above defines,
                                indepenently).
    
    
        typedef struct D3D10DDIARG_SO_DECLARATION_ENTRY
        {
            UINT OutputSlot; // Which output buffer (slot) this is going out to.
                             // outputSlot can only be [0..3].
            UINT RegisterIndex; // This specifies which GS register to take output from.
                               // The same register can appear multiple times in
                               // the declaration (and do not have to appear
                               // consecutively in the declaration), as long as the
                               // RegisterMask does not overlap for repeated registers
                               // within a Stream.  Separate streams can overlap
                               // output registers and component masks freely.
                               // If there’s no GS, RegisterIndex refers to the
                               // appropriate "register" from the previous active
                               // Pipeline Stage's output.
                               // There is no limit on the total number of unique
                               // registers that can referenced (e.g. all 32 GS
                               // output registers can be referenced), as long
                               // as the amount of data doesn't exceed 128 32-bit
                               // values.
                               // A special RegisterIndex, 0xffffffff, represents
                               // a gap in stream output.  In this case, no data
                               // from the pipeline is written out; instead the
                               // components specified by RegisterMask are skipped in
                               // the output (and the output memory is unchanged).
                               // The only valid RegisterMask values for gaps are
                               // are .x, .xy, .xyz or .xyzw, representing
                               // gaps of 1, 2, 3 or 4 components, respectively.
                               // Larger gaps are defined by chaining together
                               // smaller gaps (at least at the DDI).
            DWORD RegisterMask;// Mask (i.e. xyzw mask) to apply to this “register”
                               // coming from the Pipeline.  This must be a subset of
                               // the mask for the “register” in the source Pipeline
                               // Stage’s output, and cannot have gaps between
                               // components.  To define gaps betwen components,
                               // such as writing .xw, separate declaration
                               // entries areused, e.g. for .xw, an entry for
                               // .x, an entry for the gap, and an entry for .w.
                               //
                               // The width of the mask defines how much far the
                               // Stream Output location advances.  For example, if
                               // the mask is .yzw, then Stream Output writes 32-bit*3
                               // yzw.
                               // To accomplish complex layouts, such as swapping
                               // component order or interleaving components from
                               // multiple registers, and having gaps, multiple
                               // declaration entries are used (allowing
                               // Stream Output to be defined a component at a time).
                               //
                               // See RegisterIndex above for special behavior when
                               // the register is set to 0xffffffff (gaps).
                               //
                               // RegisterMask cannot be empty.
                               //
                               // ------
                               //
                               // Example scenario for RegisterMask:
                               // Suppose - RegisterIndex is 10, and
                               //         - the GS declares o10.yzw for output.
                               //
                               // In this case, RegisterMask would be allowed only to be
                               // the following, where (#) indicates how far in
                               // multiples of 32 bits the stream output location
                               // advances:
                               // .y (1), .z (1), .w (1), .yz (2), .zw (2), .yzw (3).
        } D3D10DDIARG_SO_DECLARATION_ENTRY;
    

    14.3.1 Summary of Using Stream Output

    In order to use Stream Output, the application must:

    Below is a very rough example (using pseudocode) of the sequence of operations an application might peform and how to calculate vertex counts.

    What the Shader wants to do:
    Suppose the GS needs to output:
          float2 A
          int4 B
          float3 C
          float3 D

    The shader needs {A, B} to be output at one frequency as a point list.

    {C, D} are to be output at another frequency as a point list.

    A needs to go to buffer 0.

    B needs to go to buffer 1.

    A and B both need to go to the rasterizer as well.

    C and D need to go to buffer 2.

    The shader needs to output up to 100 of {A,B} and up to 70 of {C,D}, worst case 170 (100+70) emits total.

    How this is accomplished by the application (basically by declaring exactly what is needed):
    The Geometry Shader declares A and B into one stream (say stream 0), so emits of the data to stream 0 are done via emit(0). HLSL declares in the shader IL that A goes to o0.xy, B goes to o1.xyzw.

    C and D are declared into another stream (stream 1), so emits to stream 1 are done via emit(1). HLSL declares in the shader IL that C goes to o0.xyz and D goes to o1.xyz.

    The CreateGeometryShaderWithStreamOutput() call tags Stream 0 as going to the rasterizer.

    Stream 0 and Stream 1 are declared as a point list topology (in fact whenever producing multiple streams, the only available topology is point list for each of them).

    Vertices can be emitted to either stream in any order.

    The shader code doesn’t need to know anything about the mapping of A,B,C,D to buffers/formats/memory layout. Like DX10, the buffer output declaration that accompanies the shader at CreateGeometryShaderWithBufferOut is responsible for those assignments and format definitions. This API validates stream constraints, like enforcing that outputs declared in different streams in the shader cannot be sent to the same buffer. In contrast, what this example does is valid – parts of a single output stream are split across multiple buffers.

    The GS output declaration declares the max output vertex count as 170. As a result, shader compilation fails for this example! The reason is that the output vertex record size, based on the output declarations for the 2 streams, is the union of the declarations of each. Since stream 0 defines o0.xy and o1.xyzw, and stream 1 defines o0.xyz and o1.xyz, the union is {o0.xyz,o1.xyzw} = 7 scalars. 7 * 170 vertices = 1190, which is greater than 1024. If it happened that stream 1 also declared o0.xy and o1.xyzw (same as stream 0), the record size would have been 6 scalars, and 6*170 = 1020 which would have been valid.

    14.4 Current Stream Output Location

    Buffers used for Stream Output need to have a way to keep track of how full they are, in order to support the append ability and potentially to be able to invoke DrawAuto(8.9) without the CPU knowing how full the Buffer is at that time. See the Stream Output Pipeline Bind Flag for Buffers(5.3.4). This value is referred to as the BufferFilledSize. When the Buffer is newly created, the BufferFilledSize must equal 0.

    In addition to structure definition (or type declaration for single Element Buffer) there is a mechanism for defining the starting offset into the Buffers where Shader outputs will start to be written. This offset is equivalent/equal to the BufferFilledSize associated with each Stream Output Buffer, since defining the starting offset also redefines the BufferFilledSize value. The next Draw() calls will begin streaming output data to the Buffer, starting at the offset, effectively appending data to the Buffer and accumulating the BufferFilledSize value associated with the Buffer. Subsequent Draw() calls continue to append to the location after the previous Draw() call finished. This is as if the starting offset were implicitly moved forward at the end of each Draw() call. The starting offset can also simply be reset to any location in the Buffer, overriding the implicit advancement after Draw() calls, and redefining the BufferFilledSize. When setting the Stream Output Buffer and starting Buffer offset, a reserved value for the starting Buffer offser (Ex. -1) is used to indicate to use the BufferFilledSize of the Buffer as the starting Buffer offset. This will allow a Stream Output Buffer to be appended to even if the Buffer is unbound from the Pipeline and bound back again later. So, these two call patterns would be identical:

    SetStreamOutput( pBuffer, 0 ); // Buffer, & starting offset.
    Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize.
    Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize.
    Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize.
    
    SetStreamOutput( pBuffer, 0 ); // Buffer, & starting offset.
    Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize.
    SetStreamOutput( pBuffer, -1 ); // Buffer, & starting offset = pBuffer's BufferFilledSize
    Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize.
    SetStreamOutput( pBuffer, -1 ); // Buffer, & starting offset = pBuffer's BufferFilledSize
    Draw(); // appends Stream Output & increases pBuffer's BufferFilledSize.
    

    14.5 Tracking Amount of Data Streamed Out

    In order to monitor how much data the Pipeline has streamed out, there are a some asynchronous queries: SO_STATISTICS(20.4.9) and SO_OVERFLOW_PREDICATE(20.4.10)s. In short, SO_STATISTICS provides a mechanism to retrieve values from two hardware counters for each Stream:
    (a) UINT64 NumPrimitivesWritten = the number of primitives written to a Stream
    (b) UINT64 PrimitiveStorageNeeded = the total number of primitives that would have been written given sufficient storage for the Buffer(s) in a Stream.
    Since the raw values of hardware counters are typically never useful, the popular usage of these counters is that they will be sampled twice and then subtracted from each other. The NumPrimitivesWritten difference and PrimitiveStorageNeeded difference will not be equal if the Draw() call(s), which were invoked between the two hardware counter sample points, generate more primitives than there is space left in the smallest of the currently bound Buffer(s) to store them. Note there is only one NumPrimitivesWritten counter per Stream even though it is possible to have multiple simultaneous Buffers bound for writing by a Stream. Stream Output is defined to stop all writes to a Stream if one of the Buffers being written by that stream does not have room for another complete primitive.

    The hardware always writes as many complete primitives (e.g. 3 vertices for a triangle) as possible to the Buffer(s) for a Stream; a given primitive is written only if there is enough space for its entire contents (e.g. 3 times the vertex stride for triangles must be available in the Buffer) in all the output Buffers for the Stream. If any Buffer for a Stream becomes full before the Draw() call has completed (i.e. no more space for a complete primitive to be appended), Shader execution continues, along with sustained incrementing of the PrimitiveStorageNeeded counter for that Stream, but not the NumPrimitivesWritten counter for that Stream. In addition, the Shader's outputs are no longer written to any of the output Buffers for that Stream. Output to other Streams functions independently.

    An application can detect the overflow condition with the SO_OVERFLOW_PREDICATE(20.4.10). In particular, there are 4 + 1 predicates, one for each Stream, and an additional predicate that indicates if any of the 4 Streams has overflowed. These predicates can be used to mask future graphics commands to, for example, prevent a corrupted frame from being shown to the application. This could be useful when streaming unpredictable mounts of data out from the Geometry Shader.

    If multiple Buffers are being written by a given Stream, as soon as one of the Buffers can no longer hold any more complete primitives, writes to ALL Buffers for that Stream are stopped, while as mentioned above, Shader execution continues, and the PrimitiveStorageNeeded counter continues to tally for that Stream. Other Streams operate independently.

    14.6 Stream Output Buffer Bind Rules

    If an output buffer slot (0..3) has data streamed out to it (as indicated by the stream output declaration), but no buffer is attached, then that output buffer slot is treated as if a full buffer is attached, resulting in the overflow behavior described here(14.5).

    If an output buffer slot does not have data being streamed out to it, and a buffer is attached, then that buffer is fully ignored, including having no impact on overflow and output tracking.

    14.7 Stream Output Is Orthogonal to Rasterization

    The path through Rasterizer output is always available, even if Stream Output is active. When the Stream Output declaration is provided (created), the application must have indicated one of the output Streams as being enabled for Rasterization. This is covered in the DDI here(14.3).


    15 Rasterizer Stage


    Chapter Contents

    (back to top)

    15.1 Rasterizer State
    15.2 Disabling Rasterization
    15.3 Always Active: Clipping, Perspective Divide, Viewport Scale
    15.4 Clipping
    15.5 Perspective divide
    15.6 Viewport
    15.7 Scissor Test
    15.8 Viewport and Scissor Controls
    15.9 Viewport/Scissor State
    15.10 Depth Bias
    15.11 Cull State
    15.12 IsFrontFace
    15.13 Fill Modes
    15.14 State Interaction With Point/Line/Triangle Rasterization Behavior
    15.15 Per-Primitive RenderTarget Array Slice Selection
    15.16 Rasterizer Precision
    15.17 Conservative Rasterization
    15.18 Axis-Aligned Quad Rasterization


    Summary of Changes in this Chapter from D3D10 to D3D11.3

    Back to all D3D10 to D3D11.3 changes.(25.2)

    An Rasterizer overview is here(2.8). Many fundamental basics of Rasterizer operation are also provided in the Basics(3) section.

    Vertices (x,y,z,w), coming to the Rasterizer, are assumed to be in homogenous clip-space. In this coordinate space the X axis points right, Y points up and Z points away from camera.

    15.1 Rasterizer State

    The meanings of the states are either self explanatory, or described further below.

    typedef struct D3D11_RASTERIZER_DESC1
    {
        D3D11_FILL_MODE         FillMode;               // described bleow
        D3D11_CULL_MODE         CullMode;               // described below
        BOOL                    FrontCounterClockwise;  // do CCW primitive count as front for culling?
        UINT                    DepthBias;              // described below
        float                   SlopeScaledDepthBias;   // described below
        float                   DepthBiasClamp;         // described below
        BOOL                    DepthClipEnable;        // described below
        BOOL                    ScissorEnable;          // described below
        BOOL                    MultisampleEnable;      // see Line State(15.14.1) (the name Multisample is misleading; it affects lines only)
        BOOL                    AntialiasedLineEnable;  // see Line State(15.14.1)
        UINT                    ForcedSampleCount;      // see Target Independent Rasterization(3.5.6)
    } D3D11_RASTERIZER_DESC1;
    

    Rasterizer state is encapsulated in a object, which once created can not be edited. Up to 4096 such objects can be created on a given device context.

    The reason for the limit on number immutable Rasterizer State objects that can be created is to enable hardware to maintain references to multiple of these in flight in the Pipeline without having to track changes or flush the Pipeline, which would be necessary if rasterizer state were allowed to be edited.

    15.2 Disabling Rasterization

    Rasterization is disabled when the following are all true:

    15.3 Always Active: Clipping, Perspective Divide, Viewport Scale

    There is NO facility in D3D11 for disabling clipping of X and Y coordinates, the viewport scale, or the perspective divide if the rasterizer is enabled. Clipping of the Z coordinates can be disabled by setting the DepthClipEnable Rasterizer State(15.1) to FALSE.

    Note that this means there is no way for an application to directly pass RenderTarget-space coordinates for vertices. Vertex positions are always assumed to be in normalized space, so the Viewport transformation must always be relied upon to map to specific pixel locations.

    15.4 Clipping

    In clip space primitives are clipped to the following volume:

    0 < w
    -w <= x <= w (or arbitrarily wider range if implementation uses a guard band to reduce clipping burden)
    -w <= y <= w (or arbitrarily wider range if implementation uses a guard band to reduce clipping burden)
    0 <= z <= w

    By default primitives are clipped to a volume that includes a 0 <= z <= w depth range clip. Clipping of the Z coordinates can be disabled by setting the DepthClipEnable Rasterizer State(15.1) to FALSE. Primitives that fall outside of the depth range are thus still rendered, but are given the value of the nearest limit of the viewport depth range. Even when Z clipping is disabled, primitives must be clipped such that only w > 0 vertices result. Coordinates coming in to clipping with infinities at x,y,z may or may not result in a discarded primitive. Coordinates with NaN at x,y,z or w coming out of clipping are discarded.

    The reason to allow disabling depth clip is that it causes problems for applications such as stencil shadows, necessitating complex code to draw end-caps on geometry that exceeds the depth range. When Z clipping is disabled, primitives may not be correctly depth-sorted at the pixel level, but this is unimportant for some applications (and can be dealt with via painter's algorithm).

    There are no restrictions to the range of input vertex coordinates to clipping. Clipping operations are performed using at least float32 precision, and accordingly NaNs and infinities are processed using the floating point rules.

    Two additional mechanisms for slicing geometry against application defined planes are provided, similar to each other in programming method but different in behavior:

    (a) A method for clipping primitives against a plane at the rasterization level (i.e. allowing for intersection within an individual primitive)

    (b) A method for culling primitives if all vertices are on the "out" side of of a plane.

    These mechanisms, dubbed "Clip Distances" and "Cull Distances" respectively, are described below.

    15.4.1 Clip Distances

    To enable primitive setup / rasterizer to perform clipping against arbitrary planes defined by the application, vertex component(s) can be identified as the System Interpreted Value(4.4.5) "clipDistance". When component(s) of vertex Element(s) are identified this way, these values are each assumed to be a float32 signed distance to a plane. Primitive setup only invokes rasterization on pixels for which the interpolated plane distance(s) are >= 0.

    Multiple clip planes can be implemented simultaneously, by declaring multiple component(s) of one or more vertex elements as the System Interpreted Value "clipDistance".

    When multisampling, implementations MUST clip against clip distances at subsample resolution.

    If a vertex has a clip distance of NaN, the primitives containing that vertex are discarded.

    For further information about "clipDistance", see its listing(24.1) in the System Interpreted Values reference.

    15.4.2 Cull Distances

    To enable rough primitive-level culling against arbitrary planes defined by the application, vertex component(s) can be identified as System Interpreted Value(4.4.5) "cullDistance". When component(s) of vertex Element(s) are given this label, these values are each assumed to be a float32 signed distance to a plane. Primitives will be completely discarded if the plane distance(s) for all of the vertices in the primitive are are < 0. Said another way, if any of the plane distance(s) (data labeled as the System Interpreted Value "cullDistance") in a primitive is >= 0, the primitive is not culled (though other culling such as backface culling could still occur and is orthogonal).

    Multiple cull planes can be used simultaneously, by declaring multiple component(s) of one or more vertex elements as the System Interpreted Value "cullDistance".

    Since cullDistance culling can be done simply by looking at vertices, this can be more efficient (though more coarse) than using clipDistances, which must be able to operate at rasterization level, without having to enable a path in the Rasterizer for clipping within primitives.

    If a vertex has a cull distance of NaN, that vertex counts as "out" (as if it is < 0).

    For further information about "cullDistance", see its listing(24.2) in the System Interpreted Values reference.

    15.4.3 Multiple Simultaneous Clip and/or Cull Distances

    At most 8 components in at most 2 vertex elements may be defined as System Interpreted Values "clipDistance" or "cullDistance".

    For a given primitive with one or multiple components labeled as System Interpreted Value "cullDistance", the rejection test (primitive rejected if all distances < 0) is applied using all vertices for each cullDistance component, and if the primitive is rejected by any one or more of the tests it is discarded.

    After cullDistance processing is complete, for remaining primitives going into rasterization setup, if there are one or multiple components labeled as System Interpreted Value "clipDistance", any region(s) of a primitive that result in one or more of the clipDistances being < 0 after interpolation are not rasterized.

    Inside the Pixel Shader it is valid to declare input Element(s) labeled as System Interpreted Values "clipDistance" and "cullDistance", in which case the appropriately interpolated clip distances or cull distances show up, as expected.

    The interpolation mode declared(22.3.10) by the Pixel Shader on any input v# register labeled as System Interpreted Value "clipDistance" must be D3DINTERPOLATION_LINEAR. No such limitation exists for input v# registers labeled as System Interpreted Value "cullDistance"; these can be interpolated any way into the Pixel Shader.

    Note that clip/cull distances have no effect on GS stream output if it is active. The clip/cull can be thought of as appearing after the stream output in the Pipeline.

    15.5 Perspective divide

    After clipping, position X,Y,Z coordinates and non-constant vertex attributes with interpolation mode linear (meaning with perspective), are divided by the position W value.

    15.6 Viewport

    Viewports map clip-space vertex positions into RenderTarget space. In the RenderTarget space Y axes points down, so the Y coordinates are flipped during the viewport scale. Multiple Viewports can be made available simultaneously, so that primitives can choose their one (see Viewport Index(15.8.1)), however the basic case is to simply use a single Viewport for all rendering in a particular scene. Only one Viewport can ever apply to an individual primitive being rasterized.

    Viewport extents are specified as int32 values (except Z extents which are float32). Operations using all of the extents are done with float32 arithmetic (int32 extents converted to float32).

    There is always an implicit scissoring by the Viewport x/y extents, orthogonal to other Scissor(15.7) state. In other words, regardless of whether or not an implementation has a guard band in its clipper or not, rendering will never touch any area outside the Viewport's x/y extents (except a small nondeterministic region that appears if the viewport left and top extents have fractional coordinates, discussed in the Viewport Range(15.6.1) section).

    If a Viewport has not been set, then the default is a Viewport with all extents 0: {0,0,0,0,0.0f,0.0f}. When RenderTargets change, there is no automatic update of the Viewport.

    Viewport scale is performed using float32 arithmetic according to the following formulas:

    Xrt= (X + 1) * Viewport.Width * 0.5 + Viewport.TopLeftX
    Yrt= (1 - Y) * Viewport.Height * 0.5 + Viewport.TopLeftY
    Zrt= Viewport.MinDepth + Z * (Viewport.MaxDepth - Viewport.MinDepth)

    An additional effect of the Viewport is that in the Output Merger, just before the final rounding of z to depth-buffer format before depth compare, the z value is always clamped: z = min(Viewport.MaxDepth,max(Viewport.MinDepth,z)), in compliance with D3D11 Floating Point Rules(3.1) for min and max. This clamping occurs regardless of where z came from: out of interpolation, or from z output by the Pixel Shader (replacing the interpolated value). Z input to the Pixel Shader is not clamped (since the clamp described here occurs after the Pixel Shader).

    D3D11 may need to expose a 'cap' bit indicating whether an implementation clamps shader z input or not.

    15.6.1 Viewport Range

    Viewport MinDepth and MaxDepth must both be in the range [0.0f...1.0f], and MinDepth must be less-than or equal-to MaxDepth.

    The Rasterizer must support(15.16) fixed-point x,y positions after Viewport scale with 16.8 precision (approximately [-3276832767] range). As such D3D11 defines the following constraints on the float Viewport Width, Height, TopLeftX and TopLeftY parameters:

    -32768 <= Viewport.TopLeftX <= 32767

    -32768 <= Viewport.Width + Viewport.TopLeftX <= 32767

    -32768 <= Viewport.TopLeftY <= 32767

    -32768 <= Viewport.Height + Viewport.TopLeftY <= 32767

    Viewport parameters are validated in the runtime such that values outside these ranges will never be passed to the DDI.

    In D3D10/D3D10.1, the Viewport extents at the API were integer, but they were changed to floating point to enable fractional scrolling of viewports and to enable emulating the D3D9 coordinate system easily by using 0.5 offsets on the viewport extents.

    The runtime validates the parameters to be in valid range, skipping the call if there is an error (the DDI will never see invalid parameters).

    The behavior of the implicit scissor to the viewport with fractional viewport extents is described in the Scissor(15.7) section (basically rounding X and Y to negative infinity to get integers).

    Observe that when the viewport location is fractional, which results in rounding to determine the implicit scissor, there is effectively a non-deterministic zone of up to 1/2 pixel wide along the left and top edges within the scissor area, not covered by the viewport. Because it is optional for implementations to perform guard-band clipping to viewport extents, and even if they do, implementations of it could vary, this means that rendering results in the non-deterministic zone will be some undefined combination of background values and primitives that may or may not have been clipped off the zone.

    If an application needs to avoid artifacts from this non-deterministic zone, one approach is to simply never use fractional viewport extents. Another approach, if fractional viewports are needed, is to always subtract 1 from the intended viewport TopLeftX and TopLeftY, while adding 1 to the intended Viewport Width and Height, then defining the Scissor extents over the intended pixel area. This will crop out the non-deterministic zone and allow fractional viewports that, for example, smoothly move the inside contents (even thought the extents are rounded), without any non-deterministic rendering.

    15.7 Scissor Test

    Scissor cuts out a rectangle in RenderTarget space where pixels are permitted to appear. Any pixel outside these extents is discarded. Multiple Scissor rectangles can be active simultaneously, from which individual primitives can choose one (see Selecting Viewport/Scissor(15.8.1) below). Only one scissor rectangle can ever apply to an individual primitive being rasterized, though this does not count the implied scissoring that is always applied to the Viewport(15.6)'s x/y extents.

    Scissor extents are specified in unsigned integer, with no limits on the magnitudes of the extents. If the Scissor rectangle falls off the currently set RenderTargets, then simply nothing will get drawn. If the Scissor rectangle is larger than the currently set RenderTarget(s) or straddles an edge, then the only pixels that can be drawn are the ones in the covered area of the RenderTarget(s). The Scissor can be enabled or disabled (all Scissors together) using the Rasterizer State(15.1) ScissorEnable. If disabled, any pixel on the RenderTarget(s) can be drawn to. The default Scissor Rectangle is an empty Scissor Rectangle: {0,0,0,0}.

    The implicit scissor to the viewport (mentioned in the Viewport(15.6) section) rounds the viewport X and Y extents to negative infinity. This way the scissor extents are always integers. The rounding to derive scissor extents applies to the locations where the fractional left/right/top/bottom edges would be after the float viewport transform. E.g. the viewport width and height cannot be rounded; they must be added to unrounded TopLeftX and TopLeftY to determine the right and bottom extents, which then get rounded to determine the scissor extents.

    15.8 Viewport and Scissor Controls

    15.8.1 Selecting the Viewport/Scissor

    There is a set of 16 Viewports and Scissor rects that can be set active via the API/DDI. By default, the 0-th Viewport and Scissor settings are used during rasterization setup. But Viewports can be selected on a per-primitive basis from the Geometry Shader by naming a component of GS output vertex data "ViewportArrayIndex"(24.5). "ViewportArrayIndex", taken from the Leading Vertex(8.14) for a primitive, is interpreted as a 32-bit unsigned integer value, with meaningful values in the range [0 and n-1] (where n is the maximum number of viewports allowed). Values outside [0..n-1] are treated as 0 for indexing viewports. Should the Pixel Shader input "ViewportArrayIndex", whatever value "ViewportArrayIndex" was given shows up unmodified/unclamped in the Shader (even if out of [0..n-1] range).

    If the Geometry Shader is not used, the default 0-th Viewport and Scissor settings are used.

    15.9 Viewport/Scissor State

    typedef struct D3D11_VIEWPORT
    {
        float       TopLeftX;
        float       TopLeftY;     /* Viewport Top left */
        float       Width;
        float       Height;       /* Viewport Dimensions */
        float       MinDepth;         /* Min/max of clip Volume */
        float       MaxDepth;
    } D3D11_VIEWPORT;
    
    typedef struct D3D11_RANGE
    {
       SIZE_T Start;
       SIZE_T End; /* One past end; Size = ( End - Start ) */
    } D3D11_RANGE;
    
    typedef struct D3D11_RECT
    {
       D3D11_RANGE X;
       D3D11_RANGE Y;
    } D3D11_RECT;
    
    typedef struct D3D11_BOX
    {
       D3D11_RANGE X;
       D3D11_RANGE Y;
       D3D11_RANGE Z;
    } D3D11_BOX;
    
    SetViewports(UINT NumViewports, const D3D11_VIEWPORT *pViewports); /* NumViewports: 0 - 15 */
    SetScissorRects(UINT NumRects, const D3D11_RECT *pRects); /* NumRects: 0 - 15 */
    

    15.10 Depth Bias

    Rasterizer State(15.1) defining Depth Biasing:
        INT     DepthBias
        float   SlopeScaledDepthBias
        float   DepthBiasClamp
    
    Formulas:
    
    MaxDepthSlope = max(abs(dZ/dX),abs(dz/dy)) // approximation of max depth
                                               // slope for polygon
    
    if( SlopeScaledDepthBias != 0 )
        SlopeScaledDepthBias = SlopeScaledDepthBias * MaxDepthSlope;
        // Above: only doing SlopeScaledDepthBias math when nonzero to avoid
        // a 0*INF = NaN scenario with edge-on wireframe triangles.
        // Previously in the D3D10 spec, hardware was erroneously spec'd to
        // unconditionally multiply SlopeScaledDepthBias with MaxDepthSlope.
        // The new behavior defined here applies to any new hardware regardless
        // of what D3D API or feature level it is running against.
    
    When UNORM Depth Buffer is at Output Merger (or no Depth Buffer):
        Bias = (float)DepthBias * r + SlopeScaledDepthBias
    
        Where r is the minimum representable value > 0 in the depth buffer
        format, converted to float32.
    
    When Floating Point Depth Buffer at Output Merger:
        Bias = (float)DepthBias * 2^(exponent(max abs(z) in primitive) - r) +
                SlopeScaledDepthBias
    
        Where r is the # of mantissa bits in the floating point representation
        (excluding the hidden bit), e.g. 23 for float32.
    
    Adding Bias to z:
    
    if(DepthBiasClamp > 0)
        Bias = min(DepthBiasClamp, Bias)
    else if(DepthBiasClamp < 0)
        Bias = max(DepthBiasClamp, Bias)
    // else if DepthBiasClamp == 0, no clamping occurs
    
    if ( (DepthBias != 0) || (SlopeScaledDepthBias != 0.) )
        z = z + Bias
    

    Biasing is constant for a given primitive, with the same value added to the z for each vertex before interpolator setup.

    The biasing formulas are performed with float32 arithmetic.

    Depth Bias is not applied to any point or line primitives, except for lines drawn in wireframe mode as described in the Fill Modes(15.13) section.

    Depth Bias is disabled by setting both DepthBias and SlopeScaledDepthBias to zero, in which case the depth value is unmodified. Note that this disables propagation of IEEE specials that may be generated if the operation is performed even with zero DepthBias and SlopeScaledDepthBias values.

    Comments on one of the usage scenarios for Depth Biasing:

    One of the artifacts with shadow buffer based shadows is “shadow acne”, or a surface shadowing itself in a spotty way because of inexactness in computing the depth of a surface from the shader to be compare against the depth of the same surface in the shadow buffer. A way to alleviate this is to use DepthBias and SlopeScaledDepthBias when rendering a shadow buffer. The intent is to push surfaces out enough when rendering a shadow buffer so that when compared against themselves via shader-computed z during the shadow test, the comparison result is consistent across the surface, and local-self-shadowing is avoided.

    However, using DepthBias and SlopeScaledDepthBias alone introduces a few of its own artifacts, where an extremely steep polygon causes the bias equation to explode, pushing the polygon extremely far away from the originating surface in the shadow map. Consider a steep face, with respect to a light, that gets pushed away extremely far in relation to the dimensions of the parent object by Depth Biasing. Suppose this face is surrounded by shallower faces which the Bias equation pushed out much, much less. The resulging shadow map has a huge discont