0052 - Experimental DXIL Ops
| Status | Accepted |
|---|---|
| Authors | |
| Sponsor |
- Planned Version: SM 6.10
Introduction
This proposal introduces a method for denoting and tracking experimental dxil operations that minimize churn when an operation is rejected or delayed to a later DXIL version.
Motivation
During iterative development of the shader compiler it is beneficial to implement real lowering into real opcodes to validate that a proposal actually solves real world use cases. Traditionally this has been done by adding new opcodes right after the last released opcode in the prior DXIL version. In some cases this is sufficient, when feature development is unified, opcodes don’t change after being added, and all opcodes in a contiguous block starting from the prior release are accepted into the next release. But challenges arise during parallel feature development, from experimental feature evolution requiring opcode changes, or when a feature and its opcodes are excluded from the release while the opcodes following them are not. Excluded opcodes must either be turned into reserved opcodes or a breaking DXIL change must be synchronized between the compiler, tests, and drivers. This proposal seeks to implement a systematic method to handle these issues.
Goals
This proposal seeks to address the following points:
- Needless churn when experimental op are delayed or rejected
- Experimental feature boundaries are rigid and unaffected by SM updates
- Enable long term experiments
- Focused on core api system (hlsl instrinsics and DXIL ops)
- Works within the current intrinsics/DXIL op mechanisms
- Minimizes overall changes to the system and IHV processes
- Straightforward transition route from experimental to stable
- Soft transitions between versions of experimental ops and final ops simplify migrations
- IHV drivers can support multiple experimental versions and the final version of a set of ops in the same driver
Non-goals
Future proposals may address the topics below but this proposal seeks to be a smaller isloated change. It intends to solves immediate term challenges without investing significant engineering efforts into a generalized solution. That said, an attempt is made to avoid proposals that preclude a generalized solution. Thus this proposal explicitly avoids addressing these issues:
- Full scale generalized extension system
- Process development to enable asynchronous non-colliding development
- Metadata/RDAT/PSV0/Custom lowering are out of scope for this document
Accepted Solution
The top 16 bits of the opcode shall be used to partition the opcode into ~64k
partitions each with ~64k opcodes. The top 16 bits of the opcode are called
the FeatureID and the only valid FeatureIDs are 0x0000 and 0x8000.
Within a given partition, opcodes must be contiguous. When opcodes are retired
or transistioned between FeatureIDs, a reserved opcode must be inserted in its
place to prevent the introduction of a hole. Once an opcode has been reserved
for at least one shader model version it may be reused but this is discouraged.
0b 0000 0000 0000 0000 0000 0000 0000 0000
^^^^ ^^^^ ^^^^ ^^^^ opcode space
^^^^ ^^^^ ^^^^ ^^^^-------------------- opcode partition / FeatureID
FeatureID 0x0000 must be used for stable opcodes released in the retail
compiler. Any other value would break back compatibility.
FeatureID 0x8000 is used for all new opcodes not yet finalized for a shader
model release.
These opcodes should be added (assigned) and merged into main as early as
possible, even before they are used or tested. This reserves the opcode so
subsequent code and tests may avoid opcode collisions with other features.
During development, if an opcode needs to be changed in any breaking way from a version supported by an experimental driver, the opcode should be left behind and renamed as necessary to avoid a collision with the new version. Then a new opcode should be used to allow an easier transition for experimental development, allowing an experimental driver to support both versions for a period of time.
When an opcode with FeatureID 0x8000 is finalized for the next DXIL release,
it should be copied to FeatureID 0x0000 and assigned the next available
opcode number.
After the opcode is copied to a final DXIL opcode, the original opcode with
FeatureID 0x8000 may be kept and renamed for transition compatibility for
drivers, or it may be replaced with a reserved op to avoid reusing the opcode
for at least one DXIL release.
The opcodes spaces under each FeatureID are independent and no correlation in
the underlying value may be assumed.
When an experimental FeatureID is used the entire shader must be marked as preview.
Decision details
The leading proposals and time of writing were:
Both proposals use some number of bits from the top portion of the opcode to partition the remaining bits into seperate opcode sets. In DXC these sets map directly to tables of operations, while in clang individual opcode values are arbirarily set very late in lowering.
The team saw notable merits and issues with both solutions such that a trivial decision was not possible. Those details are discussed in their respective “Alternatives Considered” section.
Implementation complexity is within the same magnitude for both proposals. However, the key debate between the proposals is that of process complexity. The 16bit proposal introduces a reasonable amount of process management for reserving and retiring feature IDs which may have limited utility given the remaining DXIL lifecycle. There is also process costs to maintain the list of exprimental feature IDs. Conversely while the 1bit proposal lacks process complexity it also lacks flexibility or resolution.
The solution proposed is to implement the Top 16 Bit with the following restriction:
- Feature IDs must be either
0x0000or0x8000until an undetermined time in the future where the restriction may be lifted if desired.
Feature ID 0x0000 must be used for stable opcodes to ensure that existing
opcodes are not renumbered however the choice of ID 0x80000 may seem
arbitrary. 0x8000 is selected for one key feature. The set bit is the top
most bit of the opcode space. Therefore all opcodes defined with either ID will
match the underlying values they would have in the Top 1 bit proposal.
With this restriction, the compiled DXIL is proposal agnostic. Either original proposal could have reasonably generated it. This is an imperfect compromise that has two key features:
- it unblocks high priority dependencies of the feature
- it cleanly collapses into either original proposal
It is the intent of this proposal to serve as an experiment in its own right. At some point in the future this proposal should collapse into one of the above proposals in a non-breaking manner. After iterating through a number of development cycles, the level of complexity required to serve this feature will reveal itself. If the development challenges presented at the top of this document are resolved with the restrictions in place then the proposal will collapse into the 1bit solution and no further work is needed. However, if the challenges persist then clearly the restrictions are too constraining. If this occurs then a new proposal shall be made to lift the restrictions allowing for any feature ID and addressing any required process changes which will collapse into the 16bit proposal.
Implementation Considerations
DXC
DXC has considerable constraints to align implementation with existing infrastructure. All of the bit width proposals are reasonable to implement but the other proposals are beyond reasonable scope. For the bit width proposals the infrastructure needs to be updated to support multiple op code tables. The detailed explanation of that infrastructure is listed below.
Clang
Clang implementation for bit width proposals are trivial. Clang has a late
lowering for convering high level notions of the DXIL ops into the specific
opcode values. That mapping is arbitrary and can be implemented by simply
setting the correct value for the opcode in llvm/lib/Target/DirectX/DXIL.td.
For readabilty, the opcodes can be written in hex instead of the traditional
decimal value. As an example the 12th experimental op code could be written as
0x8000000C or 0x80000000 | 12 instead of 2147483660.
DXV
The DXIL Validator should be updated in response to the proposed changes
presented above. The smallest change with the notable impact would be to detect
the usage of an experimental opcodes (opcode & 0x80000000) and automatically
set the preview hash. A more robust change would be to add a DXIL flag for
expirmental allowed which is set by a --hlsl-experimental flag on the
compiler. That flag would set the DXIL flag and then the validator will error
if the flag isn’t set but an experimental opcode is used.
There may be changes to IR lowering in experimental compilers that don’t result in the emission of experimental opcodes. This means that the automatic check may miss some experimental uses.
Existing DXIL Op and HLSL Intrinsic Infrastructure
The details below are specific to DXC, clang has a significantly different infrastructure. The clang infrastructure for selecting a specific op value is an arbitrary mapping so less prose is required to highlight limitations.
In DXC, there exists a large amount of infrastructure for handling DXIL ops as special types of functions throughout the compiler. From definition to lowering to passes to validation and consumption, any solution that doesn’t fit into this system will face significant challenges from development through to the transition of operations from experimental to final official DXIL ops in a shader model, both in the compiler and in drivers consuming the ops.
There is also a high-level intrinsic system which uses its own set of opcodes
in the generated enum: hlsl::IntrinsicOp. Though these are internal to the DXC
compilation pipeline, stability of these opcodes impacts any tests with
high-level IR, such as tests for lowering.
This section outlines key areas of this system for clarity in reasoning about solutions.
DXIL Op Definitions
DXIL Ops are defined in hctdb.py, which is used by hctdb_instrhelp.py to
generate header and cpp code used directly by drivers to consume the operations,
as well as generate a variety of other code for the compiler, validator, DXIL
spec, etc…
DxilOperations.h/cpp implements the core of the system for handling DXIL
operations in a DxilModule.
DXIL OpCodes, which are always passed as a literal in the first argument of a DXIL operation call, are a contiguous set of values starting at 0, such that they may be used to directly index a table of opcode definitions at the core of this infrastructure. This OpCode argument in the DXIL Op call is the sole identifier of the operation being called. Function names reflect OpCodeClass and overloads, but this is only a means to prevent collisions between functions used by operations requiring different signatures and attributes.
The contiguous nature of DXIL OpCodes used to index into a table is the first key hurdle in defining experimental ops. If an operation at a particular index is changed in any significant way, the interpretation of IR across that change boundary produces undefined behavior (crash if you’re lucky), with no automatic mechanism to guard against this.
HLSL IntrinsicOp definitions
Intrinsic operations are normally defined in gen_intrin_main.txt, which is
parsed by hctdb.py and used by hctdb_instrhelp.py to generate the
hlsl::IntrinsicOp enum, and a bunch of tables used by custom intrinsic
overload handling code in Sema.cpp.
There is infrastructure that tracks previously assigned HL op indices by
intrinsic name in hlsl_intrinsic_opcodes.json. This can be a merge conflict
point between any parallel feature development.
While indices are separated between functions and methods, all functions or
all methods with the same name will share the same HL opcode. Generally this
isn’t a problem as the arguments (which would include an object) allow you to
differentiate things when handling opcode calls. Recently a class_prefix
attribute was added to the intrinsic definition syntax for gen_intrin_main.tx to prepend a class name, used for DxHitObject. This is just an example of
how this system can be extended to separate out ops if necessary.
HLOperationLower.cpp uses a direct table lookup from the (unsigned)
IntrinsicOp` value to the lowering function and arguments. This creates
another merge point for any experimental features (and potentially extension),
which integrate into the same intrinsic table.
There is an extension mechanism defined through the IDxcLangExtensions
interface on the DXC compiler API object. It allows you to define a separate
intrinsic table with predefined lowering strategies to produce extended ops as
external function calls outside the recognized DXIL operations. It’s meant to
enable target extensions (extra intrinsics within certain limited definitional
bounds) in HLSL for a custom backend. Modules using extensions wouldn’t be
accepted by the DXIL validator (unmodified). The way extensions must be defined,
used, and interpreted differs significantly from adding built-in HLSL
intrinsics and DXIL operations, which means it will introduce significant
burdens and limitations to initial op definitions, lowering and compiler
interaction, and make the transition to final DXIL operations painful. For
these reasons, I don’t think we should consider this extension mechanism as
part of our solution at this time.
While this document focuses on a solution for DXIL ops, the HL opcodes can
lead to difficult conflicts between independent feature development branches
as well. Avoiding these requires synchronizing hlsl_intrinsic_opcodes.json
and pre-allocated lowering table entries in HLOperationLower.cpp in a common
branch as a very first step whenever adding any new HLSL intrinsics.
IR Tests
Tests that contain DXIL, will have DXIL operation calls passing a literal i32
OpCode value in as the first argument. If these opcodes are to change
between experimental and final versions, there should be an easy way to update
the tests accordingly. Same for any high-level IR for the IntrinsicOp numbers.
There are two places where hard-coded numbers appear in tests: source IR and FileCheck statements for checking output IR.
There isn’t any known solution that doesn’t involve a change to at least the DXIL OpCodes when transitioning from experimental to final DXIL ops.
That requires either updating these across all tests (potentially with scripted regex replacement - matching could be error-prone) or adding some tool (or tool option) to translate symbolic opcodes to literal numbers as a first step.
Summary of key elements a solution should address
- DXIL Op property table indexed by OpCode
- HLOperationLower table indexed by IntrinsicOp
- A way to update and deprecate experimental opcodes during development without a new opcode overlapping an old one, leading to undefined behavior in a driver if mismatched IR is used.
- A way for the same driver to accept multiple versions of ops without undefined behavior.
- A way to easily transition tests from experimental ops to final DXIL ops
- Potentially: A way to avoid some of the more difficult HL opcode conflicts between independent feature development branches
- Minimal, or ideally no, changes required to source code interacting with or consuming DXIL ops when transitioning from experimental to final ops.
Alternative DXIL Op Solutions Considered
Top 1 bit as experimental flag
The top bit of all opcodes is a flag stating if the opcode is experimental.
0b 0000 0000 0000 0000 0000 0000 0000 0000
^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ opcode space
^-------------------------------------- opcode partition
| Partition | Use |
|---|---|
| 0 | stable |
| 1 | experimental |
No structural or shape changes to the DXIL opcode occur, the fact that the opcode has the high bit set informs that it is experimental. This makes it very easy for the validator and drivers to detect experimental opcodes.
This splits the 4 billion opcode space into two 2 billion partions. One for stable one for experimental. The proposal results in two separately contiguous op code tables.
When an opcode is transitioned to stable it must be moved to the stable opcode partition. These two tables are completely independent from each other so opcode transition will result in a complete renumbering. No assumption may be made about how the opcode number will change when moving to stable.
As opcodes are moved from expirmental to stable they will introduce holes in the expirmental opcode partition. Depending on the feature, the expiremental opcode may be retained for long term experiments or it may be changed to reserved. Once an opcode has been reserved for an entire shader model lifecycle then it may be recycled for future use. Ex: An opcode introduced in 6.8, and marked reserved in 6.9 may then be reused in 6.10
This is marginally the simpliest proposal with the least invasive set of changes. It is only marginally simpler than other reserved bit proposals.
Pros:
- Fairly simple
- Quick to implement
- Could be implemented “by hand” today by hard coding opcodes in clang, DXC requires some updates to opcode generation code. Cons:
- Not a solution for extensions
- transistion from experimental-> stable requires manual renumbering which will change the lowering.
Top 8 bits as opcode partition
This is pretty much identical to the 1 bit flag proposal except there are 256 partitions with 16 million opcodes each. The key difference is that it unlocks extension potential as extension developers such as IHVs could reserve a partition for their own use without collision with other opcodes.
| Partition | Use |
|---|---|
| 0 | stable |
| 1 | experimental |
| 2 | extension foo |
| .. | extension .. |
| 255 | extension 255 |
Pros:
- Fairly simple
- Quick to implement
- Could be implemented “by hand” today by hard coding opcodes in clang, DXC requires some updates to opcode generation code.
- Enables basic opcodes extension system Cons:
- transistion from experimental-> stable requires manual renumbering which will change the lowering.
Top 16 bits as opcode partition
Identical concept as the 1 bit proposal with a couple key changes.
0b 0000 0000 0000 0000 0000 0000 0000 0000
^^^^ ^^^^ ^^^^ ^^^^ opcode space
^^^^ ^^^^ ^^^^ ^^^^-------------------- opcode partition
This will create ~64k partitions each with ~64k opcodes.
The opcode partition is now large enough that it earns a special name
FeatureID. Along with this name change, the intended usage also changes.
Each new feature in development reserves a new FeatureID as the first step
in the development lifecycle. This enables async work as coordination is only
required at the FeatureID scope which can be atomically reserved. The one
exception is FeatureID 0x0000 which is reserved for past stable opcodes to
maintain their current value.
The validator or some other mechanism must maintain a list of experimental FeatureIDs. When an experimental FeatureID is used the entire shader must be marked as preview.
FeatureIDs may be reused once a FeatureID has been marked reserved for at least
one shader model. Ex: Feature ID 0xDEADBEEF is introduced in 6.8 as
experimental, marked as reserved for 6.9, so it may be recycled in 6.10
Pros:
- Could be implemented “by hand” today by hard coding opcodes in clang, DXC requires some updates to opcode generation code.
- Enables basic opcodes extension system
- Very flexible for opcode asignments
- Decreases the occurance of opcode collison merge conflicts Cons:
- transistion from experimental-> stable may require manual renumbering if opcodes are moved to the stable space
- More complex process for managing all the FeatureIDs and upkeeping expermental lists
- The remaining lifetime of DXIL makes it difficult to justify the complexity of the feature when there is little reason to believe it’ll be used in any real capacity
Split the opcode in half
Not a real/reasonable proposal, presented as a technically possible thing.
Lower 16 bits are the core/stable opcodes, Upper 16 bits are the experimental opcodes.
Gives 64k opcodes for stable then the upper 64k can either be chunked manually leaving all number available for opcodes or it can be partitioned as 256 chunks of 256 opcodes with the partition encoded into the opcode itself
Very similar concept as before but keeping track of opcodes is complicated. Also enables a weird situation where two opcodes “could” be encoded into a single value.
Introduce dx.opx.opcodeclass for experimental/extended ops
Denotes the experimental status in the actual opcode. Potentially doubles the opcode space depending on implementation however it doesn’t make the transistion to stable any easier and complicated the integration with the current intrinsics system.
Pros:
- Enables fairly robust extension system
- Doesn’t consume large portions of the current opcode space
- obvious from reading the DXIL that experimental/extension is being used Cons:
- transistion from experimental to stable isn’t just dropping the
x- other stable ops may have already taken that number
- complicates the experimental->stable mapping
- Not well integrated into the current system, would require notable dev work
- Unclear how to allocate extension vs experimental ops in the opx space
- Causes issues for drivers as drivers don’t currently consider the op code class
Extension/Experimental Feature Opcode
Relaxing the restriction that DXIL opcodes are immediate constants would allow a call that returns a value representing a special operation. The call creates the value from a feature ID and feature local opcode. Unique-ify information could be stored in the call directory or in metadata.
%feature_id = i32 123
%cool_operation = i32 456
%opcode = i32 dx.create.extensionop(%feature_id, %cool_operaton)
%result = i32 dx.op.binary(%opcode, %a, %b)
Pros:
- Enables vary robust extension system
- Doesn’t consume any of the current opcode space
- Obvious from reading the DXIL that experimental/extension is being used Cons:
- Transistion from experimental to stable is non trivial. See here
- Not integrated into the current system, would require notable dev work
- Breaks a pretty fundamental DXIL assumption
Single Specific Experimental Opcode with varargs
A new opcode class dx.op.extension is introduced as a core stable opcode in
which named opcode subsets can be called directly.
%opcode_set = str "My Cool Experiment"
%opcode = i32 123
%res = i32 dx.op.extension(i32 12345, %opcode_set, %opcode, operands...)
The opcode set name and specific opcode are just arbitrary values from other parts of the compiled shader.
Pros:
- Doesn’t consume any of the current opcode space
- Obvious from reading the DXIL that experimental/extension is being used
- Very flexible
- Maintains first args as immediate constant
- All the information is encoded in the call Cons:
- Transistion from experimental to stable is non trivial. See here
- Unclear how well the current system will handle varargs
- More complex to implement and integrate
dx.op.extensionwill need to support any arbitrary overload
Stabilizing with opcode subsets
Some proposals in this doc create new opcodes sets that reuse existing numbers nested under a set name or feature id. These proposals have a more complex route for transistioning from experimental to stable. There are two potential routes to be considered.
- Create a new stable opcode from scratch using the normal mechanisms that currently exist then migrate lowering paths to use it
- Maintain a notion of experimental and non-experimental opcode subsets then update the specific subset to no longer be considered experimental keeping all lowering the same
The first option has a larger churn burden but maintains the status quo and keeps the generated code relatively dense while the second option is likely the easiest transistion system from any proposal in this document at the cost of code density and introducing a second way for stable operations to exist in DXIL.
Potential HLSL Intrinsic Solutions
There are two types of intrinsic solutions that can be imagined. One where an extension author provides external code that has a custom lowering to an arbitrary extension DXIL op and one that is prebaked into the compiler and conditional enabled/disabled as appropiate.
As HLSL intrinsics are more flexable and can be reordered/renamed without burning some finite resource only the second type is being considered at the moment. The first type falls under “general purpose extension system” which is out of scope for this document.
Intrinsic functions should be handled in a reasonable way. Ideally this means that an intrinsic is only available if the experimental/extension op is also available. Likely this means updating gen_intrin_main to mark an intrinsic as experimental/extension then generating code that errors if it used in a non-experimental/non-extension environment. But that is subject to change based on the DXIL solution chosen. Once a proposal is selected this section will be updated to reflect that.
Outstanding Questions
- Should DXC have some kind of –experimental flag that turns on/off experimental intrinsics and DXIL ops?
- Related, when/how are experimental ops exposed in the compiler, when are they errors to use?
- Should the validator warn on experimental op usage?